Commit Graph

134 Commits

Author SHA1 Message Date
abdosi
0355caf20b Added support to add gbsyncd in Feature Table of Host Config DB (#11754)
Why I did:

In case of multi-asic platforms gbsyncd is not getting added to Feature Table of Host Config DB. Without this container_checker complains of not needed gbsyncd container's are running.

How I did:
Update Both Host and Namespace config db when gbsyncd docker is starting.

How I verify:
Verified on Multi-asic platforms.
2022-08-19 15:22:12 +00:00
Nikola Dancejic
f63dc738f9 [swss] Adding conditional for bgp when on multi ASIC platform (#11691)
bgp should be a per-asic service, and runs for each namespace on
multi-asic platforms. However, putting bgp in MULTI_INST_DEPENDENT
causes swss to be restarted as well as bgp. this is causing issues after #11000

Issue: #11653

This fix:

removes bgp from dependents list
adds a conditional that either adds bgp, or bgp@$DEV to separate
between single and multi-asic platforms
2022-08-17 17:10:29 +00:00
Lawrence Lee
15c80b207c [arp_update]: Resolve failed neighbors on dualtor (#11615)
In arp_update, check for FAILED or INCOMPLETE kernel neighbor entries and manually ping them to try and resolve the neighbor

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-08-11 16:19:25 +00:00
Stepan Blyshchak
3201dc93f6 [swss.sh/syncd.sh] Trap only on EXIT (#11590)
When using trap on SIGTERM the script will not react to the SIGTERM signal sent while a child is executing.
I.e, the following script does not react on SIGTERM sent to it if it is
waiting for sleep to finish:

```

trap "echo Handled SIGTERM" 0 2 3 15

echo "Before sleep"
sleep inf
echo "After sleep"
```

Instead, trap only on EXIT which covers also a scenario with exit on
SIGINT, SIGTERM.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-11 16:18:00 +00:00
Ying Xie
094745f06f [write_standby] update write_standby.py script (#11650)
Why I did it
The initial value has to be present for the state machines to work. In active-standby dual-tor scenario, or any hardware mux scenario, the value will be updtaed eventually with a delay.

However, in active-active dual-tor scenario, there is no other mechanism to initialize the value and get state machines started.
So this script will have to write something at start up time.

For active-active dualtor, 'active' is a more preferred initial value, the state machine will switch the state to standby soon if
link prober found link not in good state.

How I did it
Update the script to always provide initial values.

How to verify it
Tested on active-active dual-tor testbed.

Signed-off-by: Ying Xie ying.xie@microsoft.com
2022-08-09 23:02:09 +00:00
Stepan Blyshchak
29d29b9491 [swss.sh] clear counters cache folder on swss cold/fast reload (#11244)
A change in sonic-utilities makes all cache files be saved into a
/tmp/cache. On swss restart this cache has to be removed in case swss
starts in cold or fast mode. A related cache restoration in the warmboot
finalizer script is also updated to use new location.

- Why I did it
To fix #9817. Clear the cache directory on swss.sh except for warm start.
Also, adopted finalize-warmboot script to take the cache directory.

- How I did it
A change in sonic-utilities makes all cache files be saved into a /tmp/cache. On swss restart this cache has to be removed in case swss starts in cold or fast mode. A related cache restoration in the warmboot finalizer script is also updated to use new location.

- How to verify it
Run togather with Azure/sonic-utilities#2232. Verify counters cache is removed on config reload, cold/fast reboots, swss restart.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-08-08 20:42:54 +00:00
Nikola Dancejic
32fb4c7772 [swss] Adding bgp container as dependent of swss (#11000)
What I did:
Added bgp as a dependent of swss

Why I did it:
bgp container was not restarting on swss crash. When swss crashes, linkmgrd
doesn't initate a switchover because it cannot access the default route from
orchagent. Bringing down bgp with swss will isolate the ToR, causing linkmgrd
to initiate a switchover to the peer ToR avoiding significant packet loss.

How I did it:
Added bgp to DEPENDENT

Signed-off-by: Nikola Dancejic <ndancejic@microsoft.com>
2022-08-08 20:40:35 +00:00
abdosi
eb56dc8b90 Enable ARP Update Script for Packet based chassis. (#11465)
What I did:

    Following changes done for packet based chassis:-
    1> Run arp_update on LC's to resolve static route nexthops over backend
    port-channel interfaces.
    2> On Supervisor make sure arp_update exit gracefully
2022-07-28 20:36:54 +00:00
Jing Zhang
f25a84c148 Avoid write_standby in warm restart context (#11283)
Avoid write_standby in warm restart context.

sign-off: Jing Zhang zhangjing@microsoft.com

Why I did it
In warm restart context, we should avoid mux state change.

How I did it
Check warm restart flag before applying changes to app db.

How to verify it
Ran write_standby in table missing, key missing, field missing scenarios.
Did a warm restart, app db changes were skipped. Saw this in syslog:
WARNING write_standby: Taking no action due to ongoing warmrestart.
2022-06-30 05:15:41 +00:00
Sudharsan Dhamal Gopalarathnam
379d77af42 [lldp]Fix lldp spawned after reboot when disabled (#11080)
- Why I did it
When LLDP is disabled through feature command, it gets spawned after reboot.

- How I did it
In syncd.sh check if the service is enabled before spawning automatically during cold reboot.

- How to verify it
Disable lldp feature. Perform cold reboot and verify its not spawned.
2022-06-22 23:08:05 +00:00
shlomibitton
323aa791ec [Mellanox] [pmon] Fix for PMON service not starting when restarting SWSS service after fast/warm reboot (#10901)
- Why I did it
Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform.
Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again.

- How I did it
On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running.
If it is running it means we are at the first boot and continue normally.
If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent.

- How to verify it
Run fast/warm reboot.
service swss restart.
Observe PMON service starting.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2022-06-17 03:31:18 +00:00
judyjoseph
8fc5c9b31f Cleanup macsec stateDB tables on restart (#11066)
Clean macsec tables in STATE_DB on start
2022-06-16 02:12:59 +00:00
Lukas Stockner
c9b27cde71
[swss] Clear VXLAN tunnel table from State DB on startup (#10822)
* When reloading config after crashes, VTEP interfaces are sometimes not created since the tunnel still exists in the STATE_DB.
* Adding VXLAN_TUNNEL_TABLE to the list of tables to be cleaned in swss.sh fixes the problem.
2022-05-31 08:54:31 -07:00
shlomibitton
4ec3af86af
[Fastboot] Delay PMON service for better fastboot performance (#10567)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, PMON is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for PMON service.
Exclude for MLNX platform the start trigger of PMON when SYNCD starts in case of fastboot.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
2022-05-02 10:44:17 +03:00
shlomibitton
1d84e0d7df
[Fastboot] Delay LLDP service for better fastboot performance (#10568)
- Why I did it
Profiling the system state on init after fast-reboot during create_switch function execution, it is possible to see few python scripts running at the same time.
This parallel execution consume CPU time and the duration of create_switch is longer than it should be.
Following this finding, and the motivation to ensure these services will not interfere in the future, LLDP is delayed in 90 seconds until the system finish the init flow after fastboot.

- How I did it
Add a timer for LLDP service.
Copy the timer file to the host bin image.

- How to verify it
Run fast-reboot on MLNX platform and observe faster create_switch execution time.
This PR is dependent on PR: #10567
2022-04-28 10:35:14 +03:00
Junhua Zhai
128d762af3
[gearbox] Add peer gbsyncd for swss if gearbox exists (#10504)
Fix the issues #10501 and #9733

If having gearbox, we need:
    * add gbsyncd as a peer since swss also has dependency on gbsyncd
    * add service gbsyncd to FEATURE table if it is missing
2022-04-20 19:02:49 +08:00
Kostiantyn Yarovyi
bf4ab4a338
[Barefoot][Syncd] restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC (#9941)
Why I did it
improvement of starting barefoot SDK

How I did it
restart of the interface for cleaning txquee through which communication takes place between Sonic and openBMC

How to verify it
run sonic autorestart tests
2022-03-21 10:07:20 -07:00
Stepan Blyshchak
18d00dfbe7
[teamd.sh] kill teamd docker on warm shutdown for faster shutdown (#10219)
This can save 6 sec for teamd LAG restoration - the time between:

```
Mar  9 13:51:10.467757 r-panther-13 WARNING teamd#teamd_PortChannel1[28]: Got SIGUSR1.
Mar  9 13:52:33.310707 r-panther-13 INFO teamd#teamd_PortChannel1[27]: carrier changed to UP
```

- Why I did it
Optimize warm boot. Specifically reduce the time needed for LAG restoration.

- How I did it
Kill teamd docker after graceful shutdown of teamd processes.

- How to verify it
Run warm reboot.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
2022-03-15 09:20:36 +02:00
Lawrence Lee
a50d1f1fc8
[write_standby]: Increase timeout to 60s (#10065)
- Avoid scenarios where script times out before orchagent can establish IPinIP tunnel

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2022-02-24 14:55:45 -08:00
tbgowda
4e32f85a31
Enable SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL attribute (#9419)
Why I did it
Fixes #8980 partly.

The corresponding changes in sonic-sairedis is here :
Azure/sonic-sairedis#975

How I did it
Include changes from both repos and build an image for verification.

How to verify it
Trigger fast-reboot with the changes, see the attribute SAI_SWITCH_ATTR_UNINIT_DATA_PLANE_ON_REMOVAL being set at the SAI level.

Signed-off-by: Thushar Gowda <24815472+tbgowda@users.noreply.github.com>
2022-02-01 08:44:17 -08:00
Shi Su
4b357044b3
[bgpcfgd] Add bgpcfgd support to advertise routes (#9197)
Why I did it
Add bgpcfgd support to advertise routes.

How I did it
Make bgpcfgd subscribe to the ADVERTISE_NETWORK table in STATE_DB and configure route advertisement accordingly.

How to verify it
Added unit tests in bgpcfgd and verify on KVM about route advertisement.
2021-11-29 23:17:57 -08:00
Lawrence Lee
6e1a477ce0
[mux]: Fix mark_dhcp_packet (#9373)
- Consolidate the two [Service] sections by moving the ExecStartPre line for mark_dhcp_packet.py to the first section and removing the second.
- Make the mark_dhcp_packet.py file executable
- Also clean up mark_dhcp_packet.py
    - Remove unused imports
    - Fix spacing and line lengths to conform to PEP8
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-11-29 12:04:06 -08:00
Brian O'Connor
002827f08e
[PINS] Add APPL_STATE_DB and response path log (#9082)
- Add APPL_STATE_DB to database_config.json
- Clear APPL_STATE_DB during SwSS container restarts
- Add response path log file to logrotate config: responsepublisher.rec

Co-authored-by: PINS Working Group <sonic-pins-subgroup@googlegroups.com>
2021-11-24 10:31:06 -08:00
Junhua Zhai
240596ec7d
[gearbox] provide common gbsyncd.service.j2 to start for platform specific gbsyncd docker (#9332)
Why I did it
Fix #9059. It provides common gbsyncd.service.j2 to start for platform specific gbsyncd docker, which must be named 'gbsyncd'.

How I did it
All of platform specific gbsyncd dockers use a common name 'gbsyncd'
Use a unique systemd service template gbsyncd.service.j2 for gbsyncd docker
2021-11-23 10:44:29 -08:00
Guohan Lu
f3faf6111b Revert "[gearbox] provide common gbsyncd.service.j2 to start for platform specific gbsyncd docker (#9286)"
This reverts commit 1d2a11bbb8.
2021-11-19 10:10:55 -08:00
Junhua Zhai
1d2a11bbb8
[gearbox] provide common gbsyncd.service.j2 to start for platform specific gbsyncd docker (#9286)
Why I did it
Fix #9059. It provides common gbsyncd.service.j2 to start for platform specific gbsyncd docker, which must be named 'gbsyncd'.

How I did it
All of platform specific gbsyncd dockers use a common name 'gbsyncd'
Use a unique systemd service template gbsyncd.service.j2 for gbsyncd docker
2021-11-17 23:49:49 -08:00
trzhang-msft
689c101095
update DHCP_PACKET_MARK schema (#9077)
- update DHCP_PACKET_MARK schema in state_db
- this is an update over PR: Add service mark_dhcp_packet to mux container #9015
2021-11-02 15:55:50 -07:00
Stepan Blyshchak
4ad5f2af3f
[swss.sh] fix an issue that dependent services are not read from a file (#8943)
This is due to the SERVICE variable declared after reading a file

#### Why I did it

To fix an issue that dhcp_relay does not restart with swss.

#### How I did it

Fixed in the swss.sh script

#### How to verify it

sudo systemctl restart swss
verify dhcp_relay restarts as well.
2021-10-26 19:01:30 -07:00
trzhang-msft
4e0c4fb832
Add service mark_dhcp_packet to mux container (#9015)
- add a new service "mark_dhcp_packet" to mux container
- apply packet marks on a per-interface basis in ebtables
- write packet marks to "DHCP_PACKET_MARK" table in state_db
2021-10-26 14:10:13 -07:00
Nazarii Hnydyn
453346f8df
[teamd]: Send USR1/USR2 only to subscribers. (#8856)
To fix teamd signal handling, without which Process 'tlm_teamd' exited unexpectedly
2021-10-26 09:12:07 -07:00
Sumukha Tumkur Vani
3971c20001
Flush RESTAPI_DB when config reload is performed (#9037) 2021-10-22 11:45:19 -07:00
Lawrence Lee
d5834fcb1b Merged PR 4679112: [write_standby]: Ignore non-auto interfaces
[write_standby]: Ignore non-auto interfaces

* In the event that `write_standby.py` is used to automatically switchover interfaces when linkmgrd or bgp crashes, ignore any interfaces that are not configured to auto-switch

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
17cbfc44e6 Merged PR 4559560: [bgp]: Switch to standby if BGP container exits
[bgp]: Switch mux to standby if BGP container exits

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
69bae5b27a [write_standby]: Improve logging
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
Lawrence Lee
5232647b33 [mux]: Make write_standby available on host
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>

[write_standby]: Cleanup and fix build

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-10-15 09:59:59 -07:00
byu343
50a9587e6e
[gbsyncd] Flush GB_ASIC_DB for gbsyncd cold restart (#8633)
This is to flush the state in GB_ASIC_DB when running 'config reload'. Otherwise, the left state affects the cold restart of gbsyncd.
2021-08-31 15:52:48 -07:00
Vladyslav Morokhovych
80e0627acc [swss] Fix arp_update script (#8412)
Fix #7968

Issue is detected on SONiC.20201231.11

In test_static_route.py::test_static_route_ecmp static routes are configured, but neighbors are not resolved after config reload even after 10 minutes.
It looks like the arp_update script is starting to ping when Vlan1000 is not fully configured.
When issue is reproduced, stuck ping6 process is observed in swss container :

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         180  0.1  0.0   6296  1272 pts/0    S    17:03   0:03 ping6 -I Vlan1000 -n -q -i 0 -c 1 -W 0 ff02::1
And when arp_update script successfully resolves neighbors, we observe sleep 300 instead of ping process
2021-08-12 23:29:22 -07:00
Longxiang Lyu
6283716e1d
[swss][arp_update] Send ipv6 pings over vlan sub interfaces (#8363)
#### Why I did it
* `arp_update` fails to ping those neighbors over vlan sub interfaces.

#### How I did it
* modify `arp_update_vars.j2` to get vlan sub interfaces with ipv6 addresses assigned.
* modify `arp_update` to send ipv6 pings over those retrieved vlan sub interfaces.

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
2021-08-06 21:14:18 -07:00
byu343
2fccf0661a
[gearbox] Add gbsyncd container for Credo gearbox chips (#8144)
This change is to add a gbsyncd container to accommodate the syncd process and the SAI libraries for the Credo gearbox chips.

How I did it
This container works similar to the existing Broadcom syncd container. Its main difference is that the SAI-related dynamic libraries are replaced by the ones for Credo gearbox chips, and the container only reacts to SAI events for the gearbox chips. The SAI libraries will be provided by the package libsai-credo_1.0_amd64.deb.

For the image build, the added container will be built and included in the Broadcom platform image, after $(LIBSAI_CREDO)_URL = is replaced to the correct value. For now, as $(LIBSAI_CREDO)_URL is empty, the container build is skipped in the image build.

After the container is included in the image, in the runtime, the container will begin with checking the existence of /usr/share/sonic/hwsku/gearbox_config.json; if that file is not provided, the container will exit by itself. Therefore, for platforms unrelated to the Credo chips, as long as they are not providing the file, they will not be affected by this change.
2021-08-04 16:05:53 -07:00
mprabhu-nokia
3fd6e8d500
[systemd] ASIC status based service bringup on VOQ chassis (#7477)
Changes to allow starting per asic services like swss and syncd only if the platform vendor codedetects the asic is detected and notified. The systemd services ordering we want is database->database@->pmon->swss@->syncd@->teamd@->lldp@
There is also a requirement that management, telemetry, snmp dockers can start even if all asic services are not up.

Why I did it
For VOQ chassis, the fabric cards will have 1-N asics. Also, there could be multiple removable fabric cards. On the supervisor, swss and syncd containers need to be started only if the fabric-card is in Online state and respective asics are detected by the kernel. Using systemd, the dependent services can be in inactive state.

How I did it
Introduce a mechanism where all ASIC dependent service wait on its state to be published via PMON to REDIS. Once the subscription is received, the service proceeds to create respective dockers.
For fixed platforms, systemd is unchanged i.e. the service bring up and docker creation happens in the start()/ExecStartPre routine of the .sh scripts.
For VOQ chassis platform on supervisor, the service bringup skips docker creation in the start() routine, but does it in the wait()/ExecStart routine of the .sh scrips.
Management dockers are decoupled from ASIC docker creation.
2021-07-27 23:02:49 -07:00
Stepan Blyshchak
b3b6938fda
[dhcp-relay] make DHCP relay an extension (#6531)
- Why I did it
Make DHCP relay docker an extension. DHCP relay now carries dhcp relay commands CLI plugin and has a complete manifest.
It is installed as extension if INCLUDE_DHCP_REALY is set to y.

DEPENDS on #5939

- How I did it
Modify DHCP relay docker makefile and dockerfile. Make changes to sonic_debian_extension.j2 to install sonic packages.
I moved DHCP related CLI tests from sonic-utilities to DHCP relay docker.
This PR introduces a way to write a plugin as part of docker image and run the tests from cli-plugin-tests directory under docker directory.
The test result is available in target/docker-dhcp-relay.gz.log:

[ REASON ] :      target/docker-dhcp-relay.gz does not exist   NON-EXISTENT PREREQUISITES: docker-start target/docker-config-engine-buster.gz-load target/python-wheels/sonic_utilities-1.2-py3-none-any.whl-in
stall target/debs/buster/python3-swsscommon_1.0.0_amd64.deb-install
[ FLAGS  FILE    ] : []
[ FLAGS  DEPENDS ] : []
[ FLAGS  DIFF    ] : []
============================= test session starts ==============================
platform linux -- Python 3.7.3, pytest-3.10.1, py-1.7.0, pluggy-0.8.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sonic/dockers/docker-dhcp-relay/cli-plugin-tests, inifile:
plugins: cov-2.6.0
collecting ... collected 10 items

test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_plugin_registration PASSED [ 10%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_nonexist_vlanid PASSED [ 20%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_invalid_vlanid PASSED [ 30%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_invalid_ip PASSED [ 40%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_dhcp_relay_with_exist_ip PASSED [ 50%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_add_del_dhcp_relay_dest PASSED [ 60%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_remove_nonexist_dhcp_relay_dest PASSED [ 70%]
test_config_dhcp_relay.py::TestConfigVlanDhcpRelay::test_config_vlan_remove_dhcp_relay_dest_with_nonexist_vlanid PASSED [ 80%]
test_show_dhcp_relay.py::TestVlanDhcpRelay::test_plugin_registration PASSED [ 90%]
test_show_dhcp_relay.py::TestVlanDhcpRelay::test_dhcp_relay_column_output PASSED [100%]

=============================== warnings summary ===============================
/usr/local/lib/python3.7/dist-packages/tabulate.py:7
  /usr/local/lib/python3.7/dist-packages/tabulate.py:7: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from collections import namedtuple, Iterable

-- Docs: https://docs.pytest.org/en/latest/warnings.html
==================== 10 passed, 1 warnings in 0.35 seconds =====================
2021-07-15 10:35:56 -07:00
Stepan Blyshchak
9de7e6860b
[sonic-app-ext] support app extensions installation during build (#7593)
Signed-off-by: Stepan Blyschak stepanb@mellanox.com

Why I did it
To support building DHCP relay as extension and installing it during build time.

How I did it
Created infrastructure. Users need to define their packages in rules/sonic-packages.mk

How to verify it
Together with #6531
2021-06-29 09:07:33 -07:00
Prince Sunny
556a1dc9a8
[Mux] Do not clean-up HW_MUX_CABLE_TABLE from State DB (#7710)
Co-authored-by: Ubuntu <prsunny@prince-vm.vzw1i4tqyeburcdz5lrgulxi2c.yx.internal.cloudapp.net>
2021-05-26 09:12:34 -07:00
yozhao101
21f5e1280d
[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord. (#6849)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
2021-02-25 14:35:29 -08:00
shlomibitton
f6bee7306e
Stop teamd service before syncd (#6755)
- What I did
All SWSS dependent services should stop before SWSS service to avoid future possible issues.
For example 'teamd' service will stop before to allow the driver unload netdev gracefully.
This is to stop all LAG's before restarting syncd service when running 'config reload' command.

- How I did it
Change the order of dependent services of SWSS.

- How to verify it
Run 'config reload' command.
Previously the operation failed when a large number of PortChannel configured on the system.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-02-15 16:05:34 +02:00
Lawrence Lee
97c605f1f7
[swss]: Clear MUX-related state DB tables on start (#6759)
* Add *MUX_CABLE_TABLE* to set of tables to clear on SWSS start, which
will clear HW_MUX_CABLE_TABLE and MUX_CABLE_TABLE
* Order swss to start before pmon to ensure that DBs are cleared before
xcvrd (running inside pmon) starts and re-populates the tables

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-02-14 12:43:49 -08:00
Lior Avramov
6f8c31554f
[systemd] Increase syncd startup script timeout to support FW upgrade on init. (#6709)
**- Why I did it**
To support FW upgrade on init.

**- How I did it**
Change timeout value

**- How to verify it**
I manually changed ASIC and Gearbox FW followed by hard reset in order for FW upgrade to take place on init.

Signed-off-by: liora <liora@nvidia.com>
2021-02-11 12:53:36 +02:00
Guohan Lu
3f2a39d583 [proc-exit-listener]: fix syntax error
the bug is introduced in commit 34cca20c

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-02-02 03:58:20 -08:00
Guohan Lu
34cca20cb6 [proc-exit-listener]: ignore blank lines
make proc-exit-listener more rebust

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 19:41:59 -08:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00