Submodule commits included:
* src/sonic-platform-common 6ad0004...bd4dc03 (1):
> [sonic_sfp/qsfp_dd.py] Update DOM capability method name to align with other drivers (#163)
Also align all calling function names to match.
When Building syncd-rpc, libthrift has dependency on libboost-atomic1.71.0,
however the debian packager install version 1.67 instead. This PR
preinstalls libboost-atomic v 1.71 to avoid falling back to v 1.67.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
After migrating to python3, the operator '/' always get a float result, but it gets integer result in python2. Need fix this in thermal_conditions.
**- How I did it**
1. cast float value to int
2. change the unit test case to cover this situation
**- How to verify it**
Manually test and regression test
- Apply device MAC on port host interface when port is removed from LAG.
- [Shared Headroom]: fixed watermark handling for SHP flow
- Decrease verbosity of policer unbind message when no policer is attached
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
**- Why I did it**
PR https://github.com/Azure/sonic-platform-common/pull/102 modified the name of the SFF-8436 (QSFP) method to align the method name between all drivers, renaming it from `parse_qsfp_dom_capability` to `parse_dom_capability`. Once the submodule was updated, the callers using the old nomenclature broke. This PR updates all callers to use the new naming convention.
**- How I did it**
Update the name of the function globally for all calls into the SFF-8436 driver.
Note that the QSFP-DD driver still uses the old nomenclature and should be modified similarly. I will open a PR to handle this separately.
**- Why I did it**
Reduce the time it takes for the ASIC FW burn as part of the automatic FW upgrade procedure.
**- How I did it**
Add -d option to mlxfwmanager tool to use the faster MST device and not the default one which is not the fastest one.
**- How to verify it**
I manually changed ASIC FW followed by reboot command in order for FW upgrade to take place on deinit.
I manually changed ASIC FW followed by hard reset in order for FW upgrade to take place on init.
Signed-off-by: liora <liora@nvidia.com>
**Why I did it**
Disable SDK extended dump due to issue found
**How I did it**
Update SAI submodule
**How to verify it**
Verify the SDK extended dump is not called.
Signed-off-by: Eran Dahan <erand@nvidia.com>
- Why I did it
Fix issue: ptf_nn_agent isn't able to start in syncd-rpc docker on buster.
- How I did it
The issue is fixed by installing python-dev, cffi and nnpy for python 2 explicitly.
- How to verify it
Run copp test on RPC image.
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
Changes in the new release:
1. Policy based hashing optimization
2. New attribute support for Max port headroom
3. Tunnel ECN map fixes
4. Tunnel EVPN skeleton extensions (peer attrib, maps)
5. Bridge port admin not affecting port admin (optimize port down time)
6. CRM new API for neighbors and tunnel termination entries
7. Improve FDB event for flush by bridge port (before, null bridge was reported to SONiC, now the bridge will be extracted from bridge port)
8. DHCP L2 v4+v6 traps (for ZTP use case)
9. Generic counter implementation
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- combine docker-ptf-saithrift into docker-ptf docker
- build docker-ptf under platform vs
- remove docker-ptf for other platforms
Signed-off-by: Guohan Lu <lguohan@gmail.com>
During ISSU, "mlxsw_minimal" driver still trying to access firmware, in some cases FW could return some wrong critical threshold value which will cause switch shutdown.
**- How I did it**
In order to prevent "mlxsw_minimal" driver from accessing ASIC during ISSU, SDK will raise "OFFLINE" 'udev' event
at the early beginning of such flow. When this event is received, hw-management will remove "mlxsw_minimal" driver.
There is no need to implement the opposite "ONLINE" event since this flow is ended up with "kexec".
**- How to verify it**
repeatedly perform warm reboot, make sure there is no switch shutdown occurred.
Bugs fixes:
All | Kernel | During system reload when CPU is loaded with heavy traffic, a Kernel Panic may occur.
All | Modules, Port split | FW stuck when device rebooted with locked Optical Transceivers in split mode
Spectrum-3 | PFC | On Spectrum-3 systems, slow reaction time to Rx pause packets on 40GbE ports may lead to buffer overflow on servers.
Spectrum-3 | SN4700, Port Split | On rare occasion SN4700, conducting 100G split (4x25G) in NRZ when splitter port 1 or 2 are down, ports 3 and 4 will also go down.
Enahncments:
All | Kernel | new notification on ISSU start, so other kernel drivers can disable any interface to ASIC
Signed-off-by: Kebo Liu <kebol@nvidia.com>
**- Why I did it**
On the Mellanox platform, reboot cause is fetched from some certain sysfs which is created by the hw-management service. So determine-reboot-cause service shall start after hw-management, otherwise it could fail due to the related sysfs is not available yet.
**- How I did it**
Add a patch to the hw-management service to make sure determine-reboot-cause service should start after it.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
In rare case can see that xcvrd failed due to "UnboundLocalError: local variable 'label_port' referenced before assignment"
Init "label_port" as None at the beginning of the function, to avoid the case that "label_port" not assigned.
Features:
Spectrum-3 | Systems | Added GA-level support for SN4700 A0 system
All | Shared headroom | Added GA-level support for Shared headroom between PGs
Bugs fixes:
All | Counters | Sent traffic in certain size is wrongly increase to a smaller size counter, because port extended counter has a counter for sent traffic per packet-size range
All | Shared buffer | Configuring shared buffer on the fly may, on occasion, cause the chip to get stuck
Spectrum-2 | Modules | On occasion, link down is experienced with INPHI COLORZ PAM4 100G optic cables on SN3700 systems
* [Mellanox] Update SAI to 1.18.0
* [Mellanox] Update SDK to 4.4.2112
* Updated Mellanox SAI to 1.18.0.2
* Updated bcmsai debians to use SAI 1.7.1
* Updated Mellanox to use SAI 1.7.1
* Updated submodule sonic-sairedis using SAI 1.7.1
Co-authored-by: Vineet Mittal <vmittalmittal@microsoft.com>
Co-authored-by: Nazarii Hnydyn <nazariig@nvidia.com>
**- Why I did it**
To support dynamic buffer calculation.
This PR also depends on the following PRs for sub modules
- [sonic-swss: [buffermgr/bufferorch] Support dynamic buffer calculation #1338](https://github.com/Azure/sonic-swss/pull/1338)
- [sonic-swss-common: Dynamic buffer calculation #361](https://github.com/Azure/sonic-swss-common/pull/361)
- [sonic-utilities: Support dynamic buffer calculation #973](https://github.com/Azure/sonic-utilities/pull/973)
**- How I did it**
1. Introduce field `buffer_model` in `DEVICE_METADATA|localhost` to represent which buffer model is running in the system currently:
- `dynamic` for the dynamic buffer calculation model
- `traditional` for the traditional model in which the `pg_profile_lookup.ini` is used
2. Add the tables required for the feature:
- ASIC_TABLE in platform/\<vendor\>/asic_table.j2
- PERIPHERAL_TABLE in platform/\<vendor\>/peripheral_table.j2
- PORT_PERIPHERAL_TABLE on a per-platform basis in device/\<vendor\>/\<platform\>/port_peripheral_config.j2 for each platform with gearbox installed.
- DEFAULT_LOSSLESS_BUFFER_PARAMETER and LOSSLESS_TRAFFIC_PATTERN in files/build_templates/buffers_config.j2
- Add lossless PGs (3-4) for each port in files/build_templates/buffers_config.j2
3. Copy the newly introduced j2 files into the image and rendering them when the system starts
4. Update the CLI options for buffermgrd so that it can start with dynamic mode
5. Fetches the ASIC vendor name in orchagent:
- fetch the vendor name when creates the docker and pass it as a docker environment variable
- `buffermgrd` can use this passed-in variable
6. Clear buffer related tables from STATE_DB when swss docker starts
7. Update the src/sonic-config-engine/tests/sample_output/buffers-dell6100.json according to the buffer_config.j2
8. Remove buffer pool sizes for ingress pools and egress_lossy_pool
Update the buffer settings for dynamic buffer calculation
python2 is end of life and SONiC is going to support python3. This PR is going to support:
1. Mellanox SONiC platform API python3 support
2. Install both python2 and python3 verson of Mellanox SONiC platform API or pmon and host side
EEPROM cache file is not refreshed after install a new ONIE version even if the eeprom data is updated. The current Eeprom class always try to read from the cache file when the file exists. The PR is aimed to fix it.
Submodule updates include the following commits:
* src/sonic-utilities 9dc58ea...f9eb739 (18):
> Remove unnecessary calls to str.encode() now that the package is Python 3; Fix deprecation warning (#1260)
> [generate_dump] Ignoring file/directory not found Errors (#1201)
> Fixed porstat rate and util issues (#1140)
> fix error: interface counters is mismatch after warm-reboot (#1099)
> Remove unnecessary calls to str.decode() now that the package is Python 3 (#1255)
> [acl-loader] Make list sorting compliant with Python 3 (#1257)
> Replace hard-coded fast-reboot with variable. And some typo corrections (#1254)
> [configlet][portconfig] Remove calls to dict.has_key() which is not available in Python 3 (#1247)
> Remove unnecessary conversions to list() and calls to dict.keys() (#1243)
> Clean up LGTM alerts (#1239)
> Add 'requests' as install dependency in setup.py (#1240)
> Convert to Python 3 (#1128)
> Fix mock SonicV2Connector in python3: use decode_responses mode so caller code will be the same as python2 (#1238)
> [tests] Do not trim from PATH if we did not append to it; Clean up/fix shebangs in scripts (#1233)
> Updates to bgp config and show commands with BGP_INTERNAL_NEIGHBOR table (#1224)
> [cli]: NAT show commands newline issue after migrated to Python3 (#1204)
> [doc]: Update Command-Reference.md (#1231)
> Added 'import sys' in feature.py file (#1232)
* src/sonic-py-swsssdk 9d9f0c6...1664be9 (2):
> Fix: no need to decode() after redis client scan, so it will work for both python2 and python3 (#96)
> FieldValueMap `contains`(`in`) will also work when migrated to libswsscommon(C++ with SWIG wrapper) (#94)
- Also fix Python 3-related issues:
- Use integer (floor) division in config_samples.py (sonic-config-engine)
- Replace print statement with print function in eeprom.py plugin for x86_64-kvm_x86_64-r0 platform
- Update all platform plugins to be compatible with both Python 2 and Python 3
- Remove shebangs from plugins files which are not intended to be executable
- Replace tabs with spaces in Python plugin files and fix alignment, because Python 3 is more strict
- Remove trailing whitespace from plugins files
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
In order to support SONiC physical entity mib extension, a few new platform API are added to sonic-platform-common, this PR is to provide an mellanox platform implementation for those new APIs.
pick up new functions and bug fixes:
- New Features
- Add dynamic minimum tables for MSN3700X, MSN3800, MSN3420, MSN4600, MSN4700 systems
- Split hw-management to one-shot init hw-management service and thermal control services.
- Bug fixes
HW Mgmt core:
- Move PSU EEPROM configuration from kernel to user space for Spectrum 2 / Spectrum 3 system
In case of non-GA SDK version there is '-' symbol in Mellanox SDK version name. (For example: 4.4.1306-006)
In appropriate .deb packet there is '.' instead of '-'. Because of this there was problem while building SDK
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
New driver support fetching additional pages from the cable EEPROM.
There are additional information to parse now: RX/TX power, TX bias, TX fault and RX LOS.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
When detecting a new SFP insertion, read its SFP type and DOM capability from EEPROM again.
SFP object will be initialized to a certain type even if no SFP present. A case could be:
1. A SFP object is initialized to QSFP type by default when there is no SFP present
2. User insert a SFP with an adapter to this QSFP port
3. The SFP object fail to read EEPROM because it still treats itself as QSFP.
This PR fixes this issue.
Now we are reading base mac, product name from eeprom data, and the data read from eeprom contains multiple "\0" characters at the end, need trim them to make the string clean and display correct.
The `get_serial_number()` method in the ChassisBase and ModuleBase classes was redundant, as the `get_serial()` method is inherited from the DeviceBase class. This method was removed from the base classes in sonic-platform-common and the submodule was updated in https://github.com/Azure/sonic-buildimage/pull/5625.
This PR aligns the existing vendor platform API implementations to remove the `get_serial_number()` methods and ensure the `get_serial()` methods are implemented, if they weren't previously.
Note that this PR does not modify the Dell platform API implementations, as this will be handled as part of https://github.com/Azure/sonic-buildimage/pull/5609
Example of syslog message from Mellanox SAI:
"Oct 7 15:39:11.482315 arc-switch1025 INFO syncd#supervisord: syncd Oct 07 15:39:11 NOTICE SAI_BUFFER: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:1"
There is a log INFO from supervisord which actually printed NOTICE and
date again. This confusion happens becuase if SAI is not built to log
to syslog it will log everything to stdout with format "[date] [level]
[message]" so supervisord sends it to syslog with level INFO.
New logs look like:
"Oct 7 15:40:21.488055 arc-switch1025 NOTICE syncd#SDK [SAI_BUFFER]: mlnx_sai_buffer.c[3893]- mlnx_clear_buffer_pool_stats: Clear pool stats pool id:17"
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
- SN3800 vs Cisco9236 - no link copper or optics - start sending IDLE before PHY_UP for specific OPNs
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
**- Why I did it**
- Platform API implementation using sonic-cfggen to get platform name and SKU name, which will fail when the database is not available.
- Chassis name is not correctly assigned, it shall be assigned with EEPROM TLV "Product Name", instead of SKU name
- Chassis model is not implemented, it shall be assigned with EEPROM TLV "Part Number"
**- How I did it**
1. Chassis
> - Get platform name from /host/machine.conf
> - Remove get SKU name with sonic-cfggen
> - Get Chassis name and model from EEPROM TLV "Product Name" and "Part Number"
> - Add function to return model
2. EEPROM
> - Add function to return product name and part number
3. Platform
> - Init EEPROM on the host side, so also can get the Chassis name model from EEPROM on the host side.
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
We are moving toward building all Python packages for SONiC as wheel packages rather than Debian packages. This will also allow us to more easily transition to Python 3.
Python files are now packaged in "sonic-utilities" Pyhton wheel. Data files are now packaged in "sonic-utilities-data" Debian package.
**- How I did it**
- Build and install sonic-utilities as a Python package
- Remove explicit installation of wheel dependencies, as these will now get installed implicitly by pip when installing sonic-utilities as a wheel
- Build and install new sonic-utilities-data package to install data files required by sonic-utilities applications
- Update all references to sonic-utilities scripts/entrypoints to either reference the new /usr/local/bin/ location or remove absolute path entirely where applicable
Submodule updates:
* src/sonic-utilities aa27dd9...2244d7b (5):
> Support building sonic-utilities as a Python wheel package instead of a Debian package (#1122)
> [consutil] Display remote device name in show command (#1120)
> [vrf] fix check state_db error when vrf moving (#1119)
> [consutil] Fix issue where the ConfigDBConnector's reference is missing (#1117)
> Update to make config load/reload backward compatible. (#1115)
* src/sonic-ztp dd025bc...911d622 (1):
> Update paths to reflect new sonic-utilities install location, /usr/local/bin/ (#19)
Refactor SFP reset, low power get/set API, and plugins with new SDK SX APIs. Previously they were calling SDK SXD APIs which have glibc dependency because of shared memory usage.
Remove implementation "set_power_override", "tx_disable_channel", "tx_disable" which using SXD APIs, once related SDK SX API available, will add them back based on new SDK SX APIs.
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future.
- Also remove some references to sonic-daemon-base which I previously missed and add missing sonic-py-common dependency for sonic-pcied.
As part of consolidating all common Python-based functionality into the new sonic-py-common package, this pull request:
1. Redirects all Python applications/scripts in sonic-buildimage repo which previously imported sonic_device_util or sonic_daemon_base to instead import sonic-py-common, which was added in https://github.com/Azure/sonic-buildimage/pull/5003
2. Replaces all calls to `sonic_device_util.get_platform_info()` to instead call `sonic_py_common.get_platform()` and removes any calls to `sonic_device_util.get_machine_info()` which are no longer necessary (i.e., those which were only used to pass the results to `sonic_device_util.get_platform_info()`.
3. Removes unused imports to the now-deprecated sonic-daemon-base package and sonic_device_util.py module
This is the next step toward resolving https://github.com/Azure/sonic-buildimage/issues/4999
Also reverted my previous change in which device_info.get_platform() would first try obtaining the platform ID string from Config DB and fall back to gathering it from machine.conf upon failure because this function is called by sonic-cfggen before the data is in the DB, in which case, the db_connect() call will hang indefinitely, which was not the behavior I expected. As of now, the function will always reference machine.conf.
when parallel build is enabled, both docker-fpm-frr and docker-syncd-brcm
is built at the same time, docker-fpm-frr requires swss which requires to
install libsaivs-dev. docker-syncd-brcm requires syncd package which requires
to install libsaibcm-dev.
since libsaivs-dev and libsaibcm-dev install the sai header in the same
location, these two packages cannot be installed at the same time. Therefore,
we need to serialize the build between these two packages. Simply uninstall
the conflict package is not enough to solve this issue. The correct solution
is to have one package wait for another package to be uninstalled.
For example, if syncd is built first, then it will install libsaibcm-dev.
Meanwhile, if the swss build job starts and tries to install libsaivs-dev,
it will first try to query if libsaibcm-dev is installed or not. if it is
installed, then it will wait until libsaibcm-dev is uninstalled. After syncd
job is finished, it will uninstall libsaibcm-dev and swss build job will be
unblocked.
To solve this issue, _UNINSTALLS is introduced to uninstall a package that
is no longer needed and to allow blocked job to continue.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Consolidate common SONiC Python-language functionality into one shared package (sonic-py-common) and eliminate duplicate code.
The package currently includes three modules:
- daemon_base
- device_info
- logger
Align SFP key names with new standard defined in https://github.com/Azure/sonic-platform-common/pull/97
- hardwarerev -> hardware_rev
- serialnum -> serial
- manufacturename -> manufacturer
- modelname -> model
- Connector -> connector
System health feature needs to set/get system led status
- Add a led object in chassis class and initialize it when the API is called on host side
- Read/write system led system fs to get/set the status
make swss build depends only on libsairedis instead of syncd. This allows to build swss without depending
on vendor sai library.
Currently, libsairedis build also buils syncd which requires vendor SAI lib. This makes difficult to build
swss docker in buster while still keeping syncd docker in stretch, as swss requires libsairedis which also
build syncd and requires vendor to provide SAI for buster. As swss docker does not really contain syncd
binary, so it is not necessary to build syncd for swss docker.
* [submodule]: update sonic-sairedis
* ccbb3bc 2020-06-28 | add option to build without syncd (HEAD, origin/master, origin/HEAD) [Guohan Lu]
* 4247481 2020-06-28 | install saidiscovery into syncd package [Guohan Lu]
* 61b8e8e 2020-06-26 | Revert "sonic-sairedis: Add support to sonic-sairedis for gearbox phys (#624)" (#630) [Danny Allen]
* 85e543c 2020-06-26 | add a README to tests directory to describe how to run 'make check' (#629) [Syd Logan]
* 2772f15 2020-06-26 | sonic-sairedis: Add support to sonic-sairedis for gearbox phys (#624) [Syd Logan]
Signed-off-by: Guohan Lu <lguohan@gmail.com>
1. Upgrade SAI headers to v1.6.3
2. Fix traffic lost during FFB related to buffer config + optimize buffer config timing for FB
3. Add ACL fields BTH, IP flags
4. Add ACL infrastructure of different fields per ASIC type
**- Why I did it**
System health feature requires to read ASIC temperature and threshold from platform API
**- How I did it**
Implement Chassis.get_asic_temperature and Chassis.get_asic_temperature_threshold by getting value from system fs.
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
**- Why I did it**
After discussed with Joe, we use the string "/usr/bin/syncd\s" in Monit configuration file to monitor
syncd process on Broadcom and Mellanox. Due to my careless, I did not find this bug during the
previous testing. If we use the string "/usr/bin/syncd" in Monit configuration file to monitor the
syncd process, Monit will not detect whether syncd process is running or not.
If we ran the command `sudo monit procmactch “/usr/bin/syncd”` on Broadcom, there will be three
processes in syncd container which matched this "/usr/bin/syncd": `/bin/bash /usr/bin/syncd.sh
wait`, `/usr/bin/dsserve /usr/bin/syncd –diag -u -p /etc/sai.d/sai.profile` and `/usr/bin/syncd –diag -
u -p /etc/sai.d/said.profile`. Monit will select the processes with the highest uptime (at there
`/bin/bash /usr/bin/syncd.sh wait`) to match and did not select `/usr/bin/syncd –diag -u -p
/etc/sai.d/said.profile` to match.
Similarly, On Mellanox Monit will also select the process with the highest uptime (at there
`/bin/bash /usr/bin/syncd.sh wait`) to match and did not select `/usr/bin/syncd –diag -u -p
/etc/sai.d/said.profile` to match.
That is why Monit is unable to detect whether syncd process is running or not if we use the string “/usr/bin/syncd” in Monit configuration file. If we use the string "/usr/bin/syncd\s" in Monit configuration file, Monit can filter out the process `/bin/bash /usr/bin/syncd.sh wait` and thus can correctly monitor the syncd process.
**- How I did it**
**- How to verify it**
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
* Change port index in port_config.ini to 1-based
* Add default port index to port_config.ini, change platform plugins to accept 1-based port index
* fix port index in sfp_event.py
* Update sonic-sairedis (sairedis with SAI 1.6 headers)
* Update SAIBCM to 3.7.4.2, which is built upon SAI1.6 headers
* missed updating BRCM_SAI variable, fixed it
* Update SAIBCM to 3.7.4.2, updated link to libsaibcm
* [Mellanox] Update SAI (release:v1.16.3; API:v1.6)
Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
* Update sonic-sairedis pointer to include SAI1.6 headers
* [Mellanox] Update SDK to 4.4.0914 and FW to xx.2007.1112 to match SAI 1.16.3 (API:v1.6)
Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
* ensure the veth link is up in docker VS container
* ensure the veth link is up in docker VS container
* [Mellanox] Update SAI (release:v1.16.3.2; API:v1.6)
Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
* use 'config interface startup' instead of using ifconfig command, also undid the previous change'
Co-authored-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
* Trigger thermal action log only if thermal condition changes
* test file existence before read file content
* fix error for set psu fan speed
* Remove logs because it print too frequently