EEPROM cache file is not refreshed after install a new ONIE version even if the eeprom data is updated. The current Eeprom class always try to read from the cache file when the file exists. The PR is aimed to fix it.
As new hw-mgmt expose the sysfs for PSU fan max speed, we need support max/min speed for PSU fan in mellanox platform API.
Conflicts:
platform/mellanox/mlnx-platform-api/sonic_platform/fan.py
New driver support fetching additional pages from the cable EEPROM.
There are additional information to parse now: RX/TX power, TX bias, TX fault and RX LOS.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
Update SDK 4.4.1956 and FW *.2008.1956
Bugs fixes:
1. Link | Clear operational speed when link is not active
2. Spectrum-2, SN3800 | On rare occasion, link flapping due to bad BER causes traffic loss
3. Spectrum-3 | On rare occasion, link flapping due to bad BER causes traffic loss as a result of new PAM4 link maintenance flow on Spectrum-3 devices
4. Shared Buffers | On rare occasion, modifying shared buffers on a system with split port while traffic is running may cause the firmware to get stuck
5. Spectrum-3, SN4700 | Fence may fail while running 400GbE 8x port when modifying mirror session configurations under traffic
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
When detecting a new SFP insertion, read its SFP type and DOM capability from EEPROM again.
SFP object will be initialized to a certain type even if no SFP present. A case could be:
1. A SFP object is initialized to QSFP type by default when there is no SFP present
2. User insert a SFP with an adapter to this QSFP port
3. The SFP object fail to read EEPROM because it still treats itself as QSFP.
This PR fixes this issue.
Example of syslog message from Mellanox SAI:
"Oct 7 15:39:11.482315 arc-switch1025 INFO syncd#supervisord: syncd Oct 07 15:39:11 NOTICE SAI_BUFFER: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:1"
There is a log INFO from supervisord which actually printed NOTICE and
date again. This confusion happens becuase if SAI is not built to log
to syslog it will log everything to stdout with format "[date] [level]
[message]" so supervisord sends it to syslog with level INFO.
New logs look like:
"Oct 7 15:40:21.488055 arc-switch1025 NOTICE syncd#SDK [SAI_BUFFER]: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:17"
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Now we are reading base mac, product name from eeprom data, and the data read from eeprom contains multiple "\0" characters at the end, need trim them to make the string clean and display correct.
- SN3800 vs Cisco9236 - no link copper or optics - start sending IDLE before PHY_UP for specific OPNs
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
**- Why I did it**
- Platform API implementation using sonic-cfggen to get platform name and SKU name, which will fail when the database is not available.
- Chassis name is not correctly assigned, it shall be assigned with EEPROM TLV "Product Name", instead of SKU name
- Chassis model is not implemented, it shall be assigned with EEPROM TLV "Part Number"
**- How I did it**
1. Chassis
> - Get platform name from /host/machine.conf
> - Remove get SKU name with sonic-cfggen
> - Get Chassis name and model from EEPROM TLV "Product Name" and "Part Number"
> - Add function to return model
2. EEPROM
> - Add function to return product name and part number
3. Platform
> - Init EEPROM on the host side, so also can get the Chassis name model from EEPROM on the host side.
Refactor SFP reset, low power get/set API, and plugins with new SDK SX APIs. Previously they were calling SDK SXD APIs which have glibc dependency because of shared memory usage.
Remove implementation "set_power_override", "tx_disable_channel", "tx_disable" which using SXD APIs, once related SDK SX API available, will add them back based on new SDK SX APIs.
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
- Backport of https://github.com/Azure/sonic-buildimage/pull/5153 to the 201911 branch
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
As part of consolidating all common Python-based functionality into the new sonic-py-common package, this pull request:
1. Redirects all Python applications/scripts in sonic-buildimage repo which previously imported sonic_device_util or sonic_daemon_base to instead import sonic-py-common, which was added to the 201911 branch in https://github.com/Azure/sonic-buildimage/pull/5063
2. Replaces all calls to `sonic_device_util.get_platform_info()` to instead call `sonic_py_common.get_platform()` and removes any calls to `sonic_device_util.get_machine_info()` which are no longer necessary (i.e., those which were only used to pass the results to `sonic_device_util.get_platform_info()`.
3. Removes unused imports to the now-deprecated sonic-daemon-base package and sonic_device_util.py module
This is a step toward resolving https://github.com/Azure/sonic-buildimage/issues/4999
Consolidate common SONiC Python-language functionality into one shared package (sonic-py-common) and eliminate duplicate code.
The package currently includes four modules:
- daemon_base
- device_info
- logger
- task_base
NOTE: This is a combination of all changes from https://github.com/Azure/sonic-buildimage/pull/5003, https://github.com/Azure/sonic-buildimage/pull/5049 and some changes from https://github.com/Azure/sonic-buildimage/pull/5043 backported to align with the 201911 branch. As part of the 201911 port, I am not installing the Python 3 package in the base image or in the VS container, because we do not have pip3 installed, and we do not intend to migrate to Python 3 in 201911.
SAI:
Fix ECMP max groups logic
add set issu log level for spc2/spc3, as now issu is supported
set vlan max swid = 0 on sdk init, as only single swid is needed, for efficient resource usage
Fix traffic lost during FFB related to buffer config + optimize buffer config timing for FB
Add ACL fields BTH, IP flags
Add ACL infrastructure of different fields per ASIC type
Add port stat ether rx/tx oversize pkts
SDK/FW:
Added support for Finisar 100GbE SWDM Transceiver FTLC9152RGPL.
Spectrum-2 Added support for 10G BaseT modules
Added link LED support for SN4600C.
Counters | In SDK debug dump, the incorrect counter type appears for vtraps.
WJH | Without any traffic or events on the idle system, the CPU load is constantly above 4%
WJH | WJH filter currently cannot filter by PORT for buffer drop reason.
Spectrum | ACL, Unbind, Lazy Delete | Running Lazy Delete together with auto_unbind may cause rate condition errors. To work work with Lazy Delete use new INIT parameter "acl_manual_unbind" so that ACLs will notbe removed automatically when binding point is deleted.
Spectrum | ISSU | In ISSU mode, when querying for the number of configurable buffers, using the API sx_api_cos_port_buff_type_get with the count parameter as 0, the API returns the number for NORMAL mode instead.
Spectrum-2 | BER | BER monitor counts raw errors instead of effective errors
Spectrum-2 | BER | Connecting to ConnectX-5 adapter card with copper splitter cable MCP7H50-V001R30 in 1
Spectrum-2 | Cables | Link flaps in 200GbE with AOM Optic cable MMA1T00-VS
Spectrum-3 | Speeds, Link | When moving from a 400GbE link to a 1GbE link, packets may drop for 1msec right after link up
Spectrum-3 | Cables, Speeds | Using 400GbE with 3rd party systems is not supported
Spectrum-3 | LAG | After a while, LAG members become out of sync with one another
Spectrum-3 | VLAN, Ports | Packets with VLAN headers are sent to
Align SFP key names with new standard defined in https://github.com/Azure/sonic-platform-common/pull/97
- hardwarerev -> hardware_rev
- serialnum -> serial
- manufacturename -> manufacturer
- modelname -> model
- Connector -> connector
**- Why I did it**
After discussed with Joe, we use the string "/usr/bin/syncd\s" in Monit configuration file to monitor
syncd process on Broadcom and Mellanox. Due to my careless, I did not find this bug during the
previous testing. If we use the string "/usr/bin/syncd" in Monit configuration file to monitor the
syncd process, Monit will not detect whether syncd process is running or not.
If we ran the command `sudo monit procmactch “/usr/bin/syncd”` on Broadcom, there will be three
processes in syncd container which matched this "/usr/bin/syncd": `/bin/bash /usr/bin/syncd.sh
wait`, `/usr/bin/dsserve /usr/bin/syncd –diag -u -p /etc/sai.d/sai.profile` and `/usr/bin/syncd –diag -
u -p /etc/sai.d/said.profile`. Monit will select the processes with the highest uptime (at there
`/bin/bash /usr/bin/syncd.sh wait`) to match and did not select `/usr/bin/syncd –diag -u -p
/etc/sai.d/said.profile` to match.
Similarly, On Mellanox Monit will also select the process with the highest uptime (at there
`/bin/bash /usr/bin/syncd.sh wait`) to match and did not select `/usr/bin/syncd –diag -u -p
/etc/sai.d/said.profile` to match.
That is why Monit is unable to detect whether syncd process is running or not if we use the string “/usr/bin/syncd” in Monit configuration file. If we use the string "/usr/bin/syncd\s" in Monit configuration file, Monit can filter out the process `/bin/bash /usr/bin/syncd.sh wait` and thus can correctly monitor the syncd process.
**- How I did it**
**- How to verify it**
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
* Change port index in port_config.ini to 1-based
* Add default port index to port_config.ini, change platform plugins to accept 1-based port index
* fix port index in sfp_event.py