** Why I did it **
Disable SDK extended dump due to issue found
** How I did it **
Update SAI submodule
** How to verify it **
Verify the SDK extended dump is not called.
Signed-off-by: Eran Dahan <erand@nvidia.com>
when parallel build is enabled, both docker-fpm-frr and docker-syncd-brcm
is built at the same time, docker-fpm-frr requires swss which requires to
install libsaivs-dev. docker-syncd-brcm requires syncd package which requires
to install libsaibcm-dev.
since libsaivs-dev and libsaibcm-dev install the sai header in the same
location, these two packages cannot be installed at the same time. Therefore,
we need to serialize the build between these two packages. Simply uninstall
the conflict package is not enough to solve this issue. The correct solution
is to have one package wait for another package to be uninstalled.
For example, if syncd is built first, then it will install libsaibcm-dev.
Meanwhile, if the swss build job starts and tries to install libsaivs-dev,
it will first try to query if libsaibcm-dev is installed or not. if it is
installed, then it will wait until libsaibcm-dev is uninstalled. After syncd
job is finished, it will uninstall libsaibcm-dev and swss build job will be
unblocked.
To solve this issue, _UNINSTALLS is introduced to uninstall a package that
is no longer needed and to allow blocked job to continue.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
backport c4b5b002c3
make swss build depends only on libsairedis instead of syncd. This allows to build swss without depending
on vendor sai library.
Currently, libsairedis build also buils syncd which requires vendor SAI lib. This makes difficult to build
swss docker in buster while still keeping syncd docker in stretch, as swss requires libsairedis which also
build syncd and requires vendor to provide SAI for buster. As swss docker does not really contain syncd
binary, so it is not necessary to build syncd for swss docker.
[submodule]: update sonic-sairedis
1e42517996bfe41ac58d4c25ee3f93502befcb9d (HEAD -> 201911) [build]: add option to build without syncd
Signed-off-by: Guohan Lu <lguohan@gmail.com>
In order to prevent "mlxsw_minimal" driver accessing ASIC during in
service firmware upgrade flow, SDK will raise "OFFLINE" 'udev' event
at early beginning of such flow. When this event is received,
hw-managemnet will remove "mlxsw_minimal" driver.
There is no need to implement opposite "ONLINE" event, since this flow
is ended up with "kexec".
Signed-off-by: Kebo Liu <kebol@nvidia.com>
Bugs fixes:
All | Kernel | During system reload when CPU is loaded with heavy traffic, a Kernel Panic may occur.
All | Modules, Port split | FW stuck when device rebooted with locked Optical Transceivers in split mode
Spectrum-3 | PFC | On Spectrum-3 systems, slow reaction time to Rx pause packets on 40GbE ports may lead to buffer overflow on servers.
Spectrum-3 | SN4700, Port Split | On rare occasion SN4700, conducting 100G split (4x25G) in NRZ when splitter port 1 or 2 are down, ports 3 and 4 will also go down.
Enahncments:
All | Kernel | new notification on ISSU start, so other kernel drivers can disable any interface to ASIC
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
Fix setting PSU led to 'green' or 'red' states.
Fix return False if unsupported color request.
Remove 'off' option for PSU led API since it is not supported in Mellanox.
- How I did it
Fix import missing information.
Return 'False' when unsupported led color is requested, preventing an exception.
- How to verify it
Try to set PSU LED to different status with Mellanox platform device.
Try to set PSU LED color to unsupported color with Mellanox platform device.
EEPROM cache file is not refreshed after install a new ONIE version even if the eeprom data is updated. The current Eeprom class always try to read from the cache file when the file exists. The PR is aimed to fix it.
As new hw-mgmt expose the sysfs for PSU fan max speed, we need support max/min speed for PSU fan in mellanox platform API.
Conflicts:
platform/mellanox/mlnx-platform-api/sonic_platform/fan.py
New driver support fetching additional pages from the cable EEPROM.
There are additional information to parse now: RX/TX power, TX bias, TX fault and RX LOS.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
Update SDK 4.4.1956 and FW *.2008.1956
Bugs fixes:
1. Link | Clear operational speed when link is not active
2. Spectrum-2, SN3800 | On rare occasion, link flapping due to bad BER causes traffic loss
3. Spectrum-3 | On rare occasion, link flapping due to bad BER causes traffic loss as a result of new PAM4 link maintenance flow on Spectrum-3 devices
4. Shared Buffers | On rare occasion, modifying shared buffers on a system with split port while traffic is running may cause the firmware to get stuck
5. Spectrum-3, SN4700 | Fence may fail while running 400GbE 8x port when modifying mirror session configurations under traffic
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
When detecting a new SFP insertion, read its SFP type and DOM capability from EEPROM again.
SFP object will be initialized to a certain type even if no SFP present. A case could be:
1. A SFP object is initialized to QSFP type by default when there is no SFP present
2. User insert a SFP with an adapter to this QSFP port
3. The SFP object fail to read EEPROM because it still treats itself as QSFP.
This PR fixes this issue.
Example of syslog message from Mellanox SAI:
"Oct 7 15:39:11.482315 arc-switch1025 INFO syncd#supervisord: syncd Oct 07 15:39:11 NOTICE SAI_BUFFER: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:1"
There is a log INFO from supervisord which actually printed NOTICE and
date again. This confusion happens becuase if SAI is not built to log
to syslog it will log everything to stdout with format "[date] [level]
[message]" so supervisord sends it to syslog with level INFO.
New logs look like:
"Oct 7 15:40:21.488055 arc-switch1025 NOTICE syncd#SDK [SAI_BUFFER]: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:17"
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Now we are reading base mac, product name from eeprom data, and the data read from eeprom contains multiple "\0" characters at the end, need trim them to make the string clean and display correct.
- SN3800 vs Cisco9236 - no link copper or optics - start sending IDLE before PHY_UP for specific OPNs
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
**- Why I did it**
- Platform API implementation using sonic-cfggen to get platform name and SKU name, which will fail when the database is not available.
- Chassis name is not correctly assigned, it shall be assigned with EEPROM TLV "Product Name", instead of SKU name
- Chassis model is not implemented, it shall be assigned with EEPROM TLV "Part Number"
**- How I did it**
1. Chassis
> - Get platform name from /host/machine.conf
> - Remove get SKU name with sonic-cfggen
> - Get Chassis name and model from EEPROM TLV "Product Name" and "Part Number"
> - Add function to return model
2. EEPROM
> - Add function to return product name and part number
3. Platform
> - Init EEPROM on the host side, so also can get the Chassis name model from EEPROM on the host side.
Refactor SFP reset, low power get/set API, and plugins with new SDK SX APIs. Previously they were calling SDK SXD APIs which have glibc dependency because of shared memory usage.
Remove implementation "set_power_override", "tx_disable_channel", "tx_disable" which using SXD APIs, once related SDK SX API available, will add them back based on new SDK SX APIs.
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
- Backport of https://github.com/Azure/sonic-buildimage/pull/5153 to the 201911 branch
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
As part of consolidating all common Python-based functionality into the new sonic-py-common package, this pull request:
1. Redirects all Python applications/scripts in sonic-buildimage repo which previously imported sonic_device_util or sonic_daemon_base to instead import sonic-py-common, which was added to the 201911 branch in https://github.com/Azure/sonic-buildimage/pull/5063
2. Replaces all calls to `sonic_device_util.get_platform_info()` to instead call `sonic_py_common.get_platform()` and removes any calls to `sonic_device_util.get_machine_info()` which are no longer necessary (i.e., those which were only used to pass the results to `sonic_device_util.get_platform_info()`.
3. Removes unused imports to the now-deprecated sonic-daemon-base package and sonic_device_util.py module
This is a step toward resolving https://github.com/Azure/sonic-buildimage/issues/4999
Consolidate common SONiC Python-language functionality into one shared package (sonic-py-common) and eliminate duplicate code.
The package currently includes four modules:
- daemon_base
- device_info
- logger
- task_base
NOTE: This is a combination of all changes from https://github.com/Azure/sonic-buildimage/pull/5003, https://github.com/Azure/sonic-buildimage/pull/5049 and some changes from https://github.com/Azure/sonic-buildimage/pull/5043 backported to align with the 201911 branch. As part of the 201911 port, I am not installing the Python 3 package in the base image or in the VS container, because we do not have pip3 installed, and we do not intend to migrate to Python 3 in 201911.
SAI:
Fix ECMP max groups logic
add set issu log level for spc2/spc3, as now issu is supported
set vlan max swid = 0 on sdk init, as only single swid is needed, for efficient resource usage
Fix traffic lost during FFB related to buffer config + optimize buffer config timing for FB
Add ACL fields BTH, IP flags
Add ACL infrastructure of different fields per ASIC type
Add port stat ether rx/tx oversize pkts
SDK/FW:
Added support for Finisar 100GbE SWDM Transceiver FTLC9152RGPL.
Spectrum-2 Added support for 10G BaseT modules
Added link LED support for SN4600C.
Counters | In SDK debug dump, the incorrect counter type appears for vtraps.
WJH | Without any traffic or events on the idle system, the CPU load is constantly above 4%
WJH | WJH filter currently cannot filter by PORT for buffer drop reason.
Spectrum | ACL, Unbind, Lazy Delete | Running Lazy Delete together with auto_unbind may cause rate condition errors. To work work with Lazy Delete use new INIT parameter "acl_manual_unbind" so that ACLs will notbe removed automatically when binding point is deleted.
Spectrum | ISSU | In ISSU mode, when querying for the number of configurable buffers, using the API sx_api_cos_port_buff_type_get with the count parameter as 0, the API returns the number for NORMAL mode instead.
Spectrum-2 | BER | BER monitor counts raw errors instead of effective errors
Spectrum-2 | BER | Connecting to ConnectX-5 adapter card with copper splitter cable MCP7H50-V001R30 in 1
Spectrum-2 | Cables | Link flaps in 200GbE with AOM Optic cable MMA1T00-VS
Spectrum-3 | Speeds, Link | When moving from a 400GbE link to a 1GbE link, packets may drop for 1msec right after link up
Spectrum-3 | Cables, Speeds | Using 400GbE with 3rd party systems is not supported
Spectrum-3 | LAG | After a while, LAG members become out of sync with one another
Spectrum-3 | VLAN, Ports | Packets with VLAN headers are sent to
Align SFP key names with new standard defined in https://github.com/Azure/sonic-platform-common/pull/97
- hardwarerev -> hardware_rev
- serialnum -> serial
- manufacturename -> manufacturer
- modelname -> model
- Connector -> connector