Commit Graph

187 Commits

Author SHA1 Message Date
Guohan Lu
3f2a39d583 [proc-exit-listener]: fix syntax error
the bug is introduced in commit 34cca20c

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-02-02 03:58:20 -08:00
Guohan Lu
34cca20cb6 [proc-exit-listener]: ignore blank lines
make proc-exit-listener more rebust

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-27 19:41:59 -08:00
yozhao101
be3c036794
[supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-21 12:57:49 -08:00
mprabhu-nokia
41012f791e
In modular chassis, add CHASSIS_STATE_DB on control card (#5624)
HLD: Azure/SONiC#646

In modular chassis, add CHASSIS_STATE_DB on control card

Why I did it
Modular Chassis has control-cards, line-cards and fabric-cards along with other peripherals. Control-Card CHASSIS_STATE_DB will be the central DB to maintain any state information of cards that is accessible to control-card/

How I did it
Adding another DB on an existing REDIS instance running on port 6380.
2020-12-15 17:15:00 -08:00
Stephen Sun
e010d83fc3
[Dynamic buffer calc] Support dynamic buffer calculation (#6194)
**- Why I did it**
To support dynamic buffer calculation.
This PR also depends on the following PRs for sub modules
- [sonic-swss: [buffermgr/bufferorch] Support dynamic buffer calculation #1338](https://github.com/Azure/sonic-swss/pull/1338)
- [sonic-swss-common: Dynamic buffer calculation #361](https://github.com/Azure/sonic-swss-common/pull/361)
- [sonic-utilities: Support dynamic buffer calculation #973](https://github.com/Azure/sonic-utilities/pull/973)

**- How I did it**
1. Introduce field `buffer_model` in `DEVICE_METADATA|localhost` to represent which buffer model is running in the system currently:
    - `dynamic` for the dynamic buffer calculation model
    - `traditional` for the traditional model in which the `pg_profile_lookup.ini` is used
2. Add the tables required for the feature:
   - ASIC_TABLE in platform/\<vendor\>/asic_table.j2
   - PERIPHERAL_TABLE in platform/\<vendor\>/peripheral_table.j2
   - PORT_PERIPHERAL_TABLE on a per-platform basis in device/\<vendor\>/\<platform\>/port_peripheral_config.j2 for each platform with gearbox installed.
   - DEFAULT_LOSSLESS_BUFFER_PARAMETER and LOSSLESS_TRAFFIC_PATTERN in files/build_templates/buffers_config.j2
   - Add lossless PGs (3-4) for each port in files/build_templates/buffers_config.j2
3. Copy the newly introduced j2 files into the image and rendering them when the system starts
4. Update the CLI options for buffermgrd so that it can start with dynamic mode
5. Fetches the ASIC vendor name in orchagent:
   - fetch the vendor name when creates the docker and pass it as a docker environment variable
   - `buffermgrd` can use this passed-in variable
6. Clear buffer related tables from STATE_DB when swss docker starts
7. Update the src/sonic-config-engine/tests/sample_output/buffers-dell6100.json according to the buffer_config.j2
8. Remove buffer pool sizes for ingress pools and egress_lossy_pool
   Update the buffer settings for dynamic buffer calculation
2020-12-13 11:35:39 -08:00
Joe LeVeque
905a5127bb
[Python] Align files in root dir, dockers/ and files/ with PEP8 standards (#6109)
**- Why I did it**

Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace, plus it helps unify style among the SONiC codebase. Will tackle other directories in separate PRs.

**- How I did it**

Using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
2020-12-03 15:57:50 -08:00
abdosi
fad481edc1
Enhanced Feature table to support 'always_enabled' value for state and auto-restart fields. (#6000)
Added new flag value 'always_enabled' for the state and auto-restart field of feature table

init_cfg.json is updated to initialize state field of database/swss/syncd/teamd feature and auto-restart field of database feature
as always_enabled

Once the state/auto-restart value is initialized as "always_enabled" it is immutable and cannot be change via feature config commands. (config feature..) PR#Azure/sonic-utilities#1271

hostcfgd will not take any action if state field value is 'always_enabled'

Since we have always_enabled field for auto-restart updated supervisor-proc-exit-listener
not to have special check for database and always rely on value from Feature table.
2020-11-25 08:41:11 -08:00
Joe LeVeque
7bf05f7f4f
[supervisor] Install vanilla package once again, install Python 3 version in Buster container (#5546)
**- Why I did it**

We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.

**- How I did it**

- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
    - Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
2020-11-19 23:41:32 -08:00
heidinet2007
7c17c58b83
Move teamd warm reboot code to service script (#5163)
Summary: Move teamd functions to a new service script

Motivation: To segregate teamd functions in one common place. fast-reboot script calls teamd functions that should ideally be replaced by a simple call to a service script.
 
Changes: New teamd service script and path modification from /usr/bin/teamd.sh to /usr/local/bin/teamd.sh
fast-reboot script (in sonic-utilities) modification (to use new teamd.sh to stop teamd) should follow soon after this change.

Verification: VS image tests.

Signed-off-by: Vaibhav Hemant Dixit <vaibhav.dixit@microsoft.com>

Co-authored-by: heidi.ou@alibaba-inc.com <heidi.ou@alibaba-inc.com>
Co-authored-by: Ying Xie <ying.xie@microsoft.com>
2020-11-13 13:34:18 -08:00
Joe LeVeque
e0fdf45ad0
[update_chassisdb_config] Convert to Python 3 (#5838)
- Convert update_chassisdb_config script to Python 3
- Reorganize imports per PEP8 standard
- Two blank lines precede functions per PEP8 standard
2020-11-09 08:35:36 -08:00
Joe LeVeque
522a071ffb
[core_cleanup.py] Convert to Python 3; Fix bug; Improve code reuse (#5781)
- Convert to Python 3
- Fix bug: `CORE_FILE_DIR` previously was set to `os.path.basename(__file__)`, which would resolve to the script name. Fix this by hardcoding to `/var/core/` instead
- Remove locally-define logging functions; use Logger class from sonic-py-common instead
2020-11-05 10:01:12 -08:00
judyjoseph
ace7f24cba
[docker-teamd]: Add teamd as a depedent service to swss (#5628)
**- Why I did it**
On teamd docker restart, the swss and syncd needs to be restarted as there are dependent resources present.

**- How I did it**
Add the teamd as a dependent service for swss
Updated the docker-wait script to handle service and dependent services separately.
Handle the case of warm-restart for the dependent service   

**- How to verify it**

Verified the following scenario's with the following testbed 
VM1 ----------------------------[DUT 6100] -----------------------VM2,  ping traffic continuous between VMs

1. Stop teamd docker alone  
      >  swss, syncd dockers seen going away
      >  The LAG reference count error messages seen for a while till swss docker stops.
      >  Dockers back up.

2. Enable WR mode for teamd. Stop teamd docker alone  
      >  swss, syncd dockers not removed.
      >  The LAG reference count error messages not seen
      >  Repeated stop teamd docker test - same result, no effect on swss/syncd.

3. Stop swss docker. 
      >  swss, teamd, syncd goes off - dockers comes back correctly, interfaces up

4. Enable WR mode for swss . Stop swss docker 
      >  swss goes off not affecting syncd/teamd dockers.

5. Config reload 
      > no reference counter error seen, dockers comes back correctly, with interfaces up

6. Warm reboot, observations below
	 > swss docker goes off first 
	 > teamd + syncd goes off to the end of WR process.
 	 > dockers comes back up fine.
	 > ping traffic between VM's was NOT HIT

7. Fast reboot, observations below
	 > teamd goes off first ( **confirmed swss don't exit here** )
	 > swss goes off next 
	 > syncd goes away at the end of the FR process
	 > dockers comes back up fine.
	 > there is a traffic HIT as per fast-reboot

8. Verified in multi-asic platform, the tests above other than WR/FB scenarios
2020-10-23 00:41:16 -07:00
BrynXu
a2e3d2fcea
[ChassisDB]: bring up ChassisDB service (#5283)
bring up chassisdb service on sonic switch according to the design in
Distributed Forwarding in VoQ Arch HLD

Signed-off-by: Honggang Xu <hxu@arista.com>

**- Why I did it**
To bring up new ChassisDB service in sonic as designed in ['Distributed forwarding in a VOQ architecture HLD' ](90c1289eaf/doc/chassis/architecture.md). 

**- How I did it**
Implement the section 2.3.1 Global DB Organization of the VOQ architecture HLD.

**- How to verify it**
ChassisDB service won't start without chassisdb.conf file on the existing platforms.
ChassisDB service is accessible with global.conf file in the distributed arichitecture.

Signed-off-by: Honggang Xu <hxu@arista.com>
2020-10-14 15:15:24 -07:00
anish-n
e15e6a8313
[config-reload]: Add logic to clean up FG_ROUTE state db table during reload (#5518)
Cleanup FG_ROUTE state db table during reload
2020-10-02 09:25:29 -07:00
Syd Logan
0311a4a037
Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature (#4851)
* buildimage: Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature

* scripts and configuration needed to support a second syncd docker (physyncd)
* physyncd supports gearbox device and phy SAI APIs and runs multiple instances of syncd, one per phy in the device
* support for VS target (sonic-sairedis vslib has been extended to support a virtual BCM81724 gearbox PHY).

HLD is located at b817a12fd8/doc/gearbox/gearbox_mgr_design.md

**- Why I did it**

This work is part of the gearbox phy joint effort between Microsoft and Broadcom, and is based
on multi-switch support in sonic-sairedis.

**- How I did it**

Overall feature was implemented across several projects. The collective pull requests (some in late stages of review at this point):

https://github.com/Azure/sonic-utilities/pull/931 - CLI (merged)
https://github.com/Azure/sonic-swss-common/pull/347 - Minor changes (merged)
https://github.com/Azure/sonic-swss/pull/1321 - gearsyncd, config parsers, changes to orchargent to create gearbox phy on supported systems
https://github.com/Azure/sonic-sairedis/pull/624 - physyncd, virtual BCM81724 gearbox phy added to vslib

**- How to verify it**

In a vslib build:

root@sonic:/home/admin# show gearbox interfaces status
  PHY Id    Interface        MAC Lanes    MAC Lane Speed        PHY Lanes    PHY Lane Speed    Line Lanes    Line Lane Speed    Oper    Admin
--------  -----------  ---------------  ----------------  ---------------  ----------------  ------------  -----------------  ------  -------
       1   Ethernet48  121,122,123,124               25G  200,201,202,203               25G       204,205                50G    down     down
       1   Ethernet49  125,126,127,128               25G  206,207,208,209               25G       210,211                50G    down     down
       1   Ethernet50      69,70,71,72               25G  212,213,214,215               25G           216               100G    down     down

In addition, docker ps | grep phy should show a physyncd docker running.

  Signed-off-by: syd.logan@broadcom.com
2020-09-25 08:32:44 -07:00
Tamer Ahmed
2de3afaf35
[swss] Enhance ARP Update to Call Sonic Cfggen Once (#5398)
This PR limited the number of calls to sonic-cfggen to one call
per iteration instead of current 3 calls per iteration.

The PR also installs jq on host for future scripts if needed.

signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2020-09-18 18:44:23 -07:00
Vaibhav Hemant Dixit
9fdbaf0196
BGP Service script path and error fix (#5183)
* BGP service script path update and error fix

Co-authored-by: Vaibhav Hemant Dixit <vadixit@microsoft.com>
2020-08-15 12:09:10 -07:00
Vaibhav Hemant Dixit
b193af2d7f
Make RADV service script executable (#5186)
Co-authored-by: Vaibhav Hemant Dixit <vadixit@microsoft.com>
2020-08-15 12:08:09 -07:00
Vaibhav Hemant Dixit
31d7f27b4b
Move RADV fastboot handling to a service script (#5108)
* New /usr/local/bin/ script to start/stop radv service
2020-08-11 13:13:13 -07:00
Vaibhav Hemant Dixit
ca462d669e
Fix fast-reboot handling for swss script (#5070)
* Fast and warm reboot checks for SWSS start and stop path
2020-08-10 14:48:30 -07:00
lguohan
082c26a27d
[build]: combine feature and container feature table (#5081)
1. remove container feature table
2. do not generate feature entry if the feature is not included
   in the image
3. rename ENABLE_* to INCLUDE_* for better clarity
4. rename feature status to feature state
5. [submodule]: update sonic-utilities

* 9700e45 2020-08-03 | [show/config]: combine feature and container feature cli (#1015) (HEAD, origin/master, origin/HEAD) [lguohan]
* c9d3550 2020-08-03 | [tests]: fix drops_group_test failure on second run (#1023) [lguohan]
* dfaae69 2020-08-03 | [lldpshow]: Fix input device is not a TTY error (#1016) [Arun Saravanan Balachandran]
* 216688e 2020-08-02 | [tests]: rename sonic-utilitie-tests to tests (#1022) [lguohan]

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2020-08-05 13:23:12 -07:00
BrynXu
311045f01f
[vs]: support virtual-chassis setup in vs docker (#4709)
virtual-chassis test uses multiple vs instances to simulate a
modular switch and a redis-chassis service is required to run on
the vs instance that represents a supervisor card.
This change allows vs docker start redis-chassis service according
to external config file.

**- Why I did it**
To support virtual-chassis setup, so that we can test distributed forwarding feature in virtual sonic environment, see `Distributed forwarding in a VOQ architecture HLD` pull request at https://github.com/Azure/SONiC/pull/622

**- How I did it**
The sonic-vs start.sh is enhanced to start new redis_chassis service if external chassis config file found. The config file doesn't exist in current vs environment, start.sh will behave like before. 

**- How to verify it**
The swss/test still pass. The chassis_db service is verified in virtual-chassis topology and tests which are in following PRs.

Signed-off-by: Honggang Xu <hxu@arista.com>
(cherry picked from commit c1d45cf81ce3238be2dcbccae98c0780944981ce)

Co-authored-by: Honggang Xu <hxu@arista.com>
2020-07-29 14:20:31 -07:00
Kebo Liu
f3091c91a6
[Mellanox] remove code which instructs hw-mgmt to skip mlsw_minimal probing in fast-boot flow (#5011) 2020-07-22 12:21:11 +03:00
heidinet2007
de51b9e424
BGP warm reboot script to service (#3992)
* [sonic-buildimage] Move BGP warm reboot scripts into BGP service /usr/local/bin

* Revert "[sonic-buildimage] Move BGP warm reboot scripts into BGP service /usr/local/bin"

This reverts commit d16d163fc4.

* [sonic-buildimage] Move BGP warm reboot script to BGP service

* [sonic-buildimage] Move BGP warm reboot script to BGP service

* [sonic-buildimage] Move BGP warm reboot script to BGP service
- access DB correctly

* Address code review comments, also change file mode of bgp.sh (+x)

* Address code review comments, also change the file mode of bgp.sh (+x)

* BGP warm reboot script to service, also handle fast boot as indicated by flag saved in StateDB

* BGP warm reboot script to service, code review comments on space alignment

* BGP warm reboot script to service: remove uncesseary space

* BGP warm reboot script to service: replace tab with space

* Code review comments: -) use new multi-db api -) add ignore error from zebra in case it's not configured

* Integrate with multi-ASIC changes committed recently

Co-authored-by: heidi.ou@alibaba-inc.com <heidi.ou@alibaba-inc.com>
2020-07-15 11:46:23 -07:00
yozhao101
4fa81b4f8d
[dockers] Update critical_processes file syntax (#4831)
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a  container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.

Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.

**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
2020-06-25 21:18:21 -07:00
Kebo Liu
2b568ec136
Add with_i2cdev for mst start to have I2C device loaded properly (#4790) 2020-06-21 16:27:05 +03:00
yozhao101
4ea2e5e6dc
[docker-syncd] Add timeout to force stop syncd container (#4617)
**- Why I did it**
When I tested auto-restart feature of swss container by manually killing one of critical processes in it, swss will be stopped. Then syncd container as the peer container should also be
stopped as expected. However, I found sometimes syncd container can be stopped, sometimes
it can not be stopped. The reason why syncd container can not be stopped is the process
(/usr/local/bin/syncd.sh stop) to execute the stop() function will be stuck between the lines 164 –167. Systemd will wait for 90 seconds and then kill this process.

164 # wait until syncd quit gracefully
165 while docker top syncd$DEV | grep -q /usr/bin/syncd; do
166 sleep 0.1
167 done

The first thing I did is to profile how long this while loop will spin if syncd container can be
normally stopped after swss container is stopped. The result is 5 seconds or 6 seconds. If syncd
container can be normally stopped, two messages will be written into syslog:

str-a7050-acs-3 NOTICE syncd#dsserve: child /usr/bin/syncd exited status: 134
str-a7050-acs-3 INFO syncd#supervisord: syncd [5] child /usr/bin/syncd exited status: 134

The second thing I did was to add a timer in the condition of while loop to ensure this while loop will be forced to exit after 20 seconds:

After that, the testing result is that syncd container can be normally stopped if swss is stopped
first. One more thing I want to mention is that if syncd container is stopped during 5 seconds or 6 seconds, then the two log messages can be still seen in syslog. However, if the execution 
time of while loop is longer than 20 seconds and is forced to exit, although syncd container can be stopped, I did not see these two messages in syslog. Further, although I observed the auto-restart feature of swss container can work correctly right now, I can not make sure the issue which syncd container can not stopped will occur in future.

**- How I did it**
I added a timer around the while loop in stop() function. This while loop will exit after spinning
20 seconds.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-06-04 15:17:28 -07:00
judyjoseph
acf465b43b
Multi DB with namespace support, Introducing the database_global.json… (#4477)
* Multi DB with namespace support, Introducing the database_global.json file
for supporting accessing DB's in other namespaces for service running in
linux host

* Updates based on comments

* Adding the j2 templates for database_config and database_global files.

* Updating to retrieve the redis DIR's to be mounted from database_global.json file.

* Additional check to see if asic.conf file exists before sourcing it.

* Updates based on PR comments discussion.

* Review comments update

* Updates to the argument "-n" for namespace used in both context of parsing minigraph and multi DB access.

* Update with the attribute "persistence_for_warm_boot" that was added to database_config.json file earlier.

* Removing the database_config.json file to avioid confusion in future.
We use the database_config.json.j2 file to generate database_config.json files dynamically.

* Update the comments for sudo usage in docker_image_ctrl.j2

* Update with the new logic in PING PONG tests using sonic-db-cli. With this we wait till the
PONG response is received when redis server is up.

* Similar changes in swss and syncd scripts for the PING tests with sonic-db-cli

* Updated with a missing , in the database_config.json.j2 file, Do pip install of j2cli in docker-base-buster.
2020-05-08 21:24:05 -07:00
Dong Zhang
340cf826a6
[MultiDB] use sonic-db-cli PING and fix wrong multiDB API in NAT (#4541) 2020-05-06 15:41:28 -07:00
SuvarnaMeenakshi
2a59551eff
[sonic-netns-exec]: use "$@" to reflects all positional parameters as they were set initially (#4375)
sonic-netns-exec fails to execute below command in swss.sh:

    sonic-netns-exec "$NET_NS" sonic-db-cli $1 EVAL "
    local tables = {$2}
    for i = 1, table.getn(tables) do
        local matches = redis.call('KEYS', tables[i])
        for j,name in ipairs(matches) do
            redis.call('DEL', name)
        end
    end" 0

This command fails with error " redis.exceptions.ResponseError: value is not an integer or out of range" .

Root cause:

When sonic-netns-exec executes the above function, argument passed to sonic-db-cli is NOT executed as a single script.

The argument is passed as separate keywords to sonic-db-cli, as below:

['EVAL', 'local', 'tables', '=', "{'PORT_TABLE*'}", 'for', 'i', '=', '1,', 'table.getn(tables)', 'do', 'local', 'matches', '=', "redis.call('KEYS',", 'tables[i])', 'for', 'j,name', 'in', 'ipairs(matches)', 'do', "redis.call('DEL',", 'name)', 'end', 'end', '0']

- How I did it
To make sure that the parameters are passed as they were set initially, fix sonic-netns-exec to use double quoted "$@", where "$@" is "$1" "$2" "$3" ... "${N}"

After fix, the argument passed to sonic-db-cli is as below:

Argument passed to sonic-db-cli:

['EVAL', "\n    local tables = {'PORT_TABLE*'}\n    for i = 1, table.getn(tables) do\n        local matches = redis.call('KEYS', tables[i])\n        for j,name in ipairs(matches) do\n            redis.call('DEL', name)\n        end\n    end", '0']

Signed-off-by: SuvarnaMeenakshi <sumeenak@microsoft.com>
2020-04-07 00:05:47 -07:00
SuvarnaMeenakshi
4b8067e913
Multi-ASIC implementation (#3888)
Changes made to support multi-asic platform. Added multi-instance support for swss, syncd, database, bgp, teamd and lldp.
2020-03-31 10:06:19 -07:00
Joe LeVeque
7c8da20516
[sonic-cfggen] Loading the configuration from init_cfg.json and then from config_db.json (#4148) 2020-03-05 15:35:35 -08:00
Prince Sunny
7ffa2ccb43
Sleep done before mismatch handler (#4165)
* Sleep done before mismatch handler
2020-02-20 12:54:39 -08:00
Prince Sunny
1a0ce9874d
Update arp_update to refresh neighbor entries from APP_DB (#4125) 2020-02-13 10:27:37 -08:00
yozhao101
729f343f77
[Services] Restart database service upon unexpected critical process exit. (#4138)
* [database] Implement the auto-restart feature for database container.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

* [database] Remove the duplicate dependency in service files. Since we
already have updategraph ---> config_setup ---> database, we do not need
explicitly add database.service in all other container service files.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

* [event listener] Reorganize the line 73 in event listener script.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

* [database] update the file sflow.service.j2 to remove the duplicate
dependency.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

* [event listener] Add comments in event listener.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

* [event listener] Update the comments in line 56.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>

* [event listener] Add parentheses for if statement in line 76 in event listener.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2020-02-11 14:03:02 -08:00
yozhao101
91e5fb5602
[Service] Enable/disable container auto-restart based on configuration. (#4073) 2020-02-07 12:34:07 -08:00
Dong Zhang
7aa0baf709 [MultiDB] (except ./src and ./dockers dirs): replace redis-cli with sonic-db-cli and use new DBConnector (#4035)
* [MultiDB] (except ./src and ./dockers dirs): replace redis-cli with sonic-db-cli and use new DBConnector
* update comment for a potential bug
* update comment
* add TODO maker as review reqirement
2020-01-22 11:26:23 -08:00
lguohan
483a5946a8
Revert "[MultiDB]except src and dockers : replace redis-cli with sonic-db-cli and use new DBConnector (#3928)" (#4002)
This reverts commit 0dae59ac30.
2020-01-10 08:27:34 -08:00
Dong Zhang
0dae59ac30 [MultiDB]except src and dockers : replace redis-cli with sonic-db-cli and use new DBConnector (#3928)
* [MultiDB]except src and dockers : replace redis-cli with sonic-db-cli and use new DBConnector
* fix vs tests along with swss vs tests together
2020-01-02 14:46:25 -08:00
Stepan Blyshchak
b6ad09aa35 [syncd.sh] remove chipdown on mellanox (#3926)
ASIC reset events are captured by hw-mgmt and hw-mgmt calls chipup/chipdown internally without OS iteraction

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2019-12-23 11:15:08 +02:00
Ying Xie
9baf8f7c33
[swss service] flush fast-reboot enabled flag upon swss stopping (#3908)
If we need to stop swss during fast-reboot procedure on the boot up path,
it means that something went wrong, like syncd/orchagent crashed already,
we are stopping and restarting swss/syncd to re-initialize. In this case,
we should proceed as if it is a cold reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-12-16 07:58:16 -08:00
pavel-shirshov
1848fb262b [fast-reboot]: Save fast-reboot state into the db (#3741)
Put a flag for fast-reboot to the db using EXPIRE feature. Using this flag in other part of SONiC to start in Fast-reboot mode. If we reload a config, the state in the db will be removed.
2019-12-04 14:10:19 -08:00
Ying Xie
fc36ca6e45
Revert "[swss.sh] When starting, call 'systemctl restart' on dependents, not (#3807)" (#3835)
This reverts commit 351410ea8c.
2019-12-02 15:54:55 -08:00
Joe LeVeque
5e6f8adb22 [services] Remove explicit dependencies from dhcp_relay service file, control in swss.sh (#3823) 2019-11-26 16:59:45 -08:00
Joe LeVeque
351410ea8c [swss.sh] When starting, call 'systemctl restart' on dependents, not (#3807)
'systemctl start'
2019-11-22 20:39:09 -08:00
Joe LeVeque
85b0de3df1 [docker-syncd]: Restart SwSS, syncd and dependent services if a critical process in syncd container exits unexpectedly (#3534)
Add the same mechanism I developed for the SwSS service in #2845 to the syncd service. However, in order to cause the SwSS service to also exit and restart in this situation, I developed a docker-wait-any program which the SwSS service uses to wait for either the swss or syncd containers to exit.
2019-11-09 10:26:39 -08:00
yozhao101
ed79f54569 [Services] Restart DHCP-Relay service upon unexpected critical process exit. (#3667)
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2019-11-05 18:32:14 -08:00
Stephen Sun
7308d2eb97 [Mellanox] Stop pmon ahead of syncd (#3505)
Issue Overview
shutdown flow

For any shutdown flow, which means all dockers are stopped in order, pmon docker stops after syncd docker has stopped, causing pmon docker fail to release sx_core resources and leaving sx_core in a bad state. The related logs are like the following:

INFO syncd.sh[23597]: modprobe: FATAL: Module sx_core is in use.
INFO syncd.sh[23597]: Unloading sx_core[FAILED]
INFO syncd.sh[23597]: rmmod: ERROR: Module sx_core is in use
config reload & service swss.restart
In the flows like "config reload" and "service swss restart", the failure cause further consequences:

sx_core initialization error with error message like "sx_core: create EMAD sdq 0 failed. err: -16"
syncd fails to execute the create switch api with error message "syncd_main: Runtime error: :- processEvent: failed to execute api: create, key: SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000, status: SAI_STATUS_FAILURE"
swss fails to call SAI API "SAI_SWITCH_ATTR_INIT_SWITCH", which causes orchagent to restart. This will introduce an extra 1 or 2 minutes for the system to be available, failing related test cases.
reboot, warm-reboot & fast-reboot
In the reboot flows including "reboot", "fast-reboot" and "warm-reboot" this failure doesn't have further negative effects since the system has already rebooted. In addition, "warm-reboot" requires the system to be shutdown as soon as possible to meet the GR time restriction of both BGP and LACP. "fast-reboot" also requires to meet the GR time restriction of BGP which is longer than LACP. In this sense, any unnecessary steps should be avoided. It's better to keep those flows untouched.

summary
To summarize, we have to come up with a way to ensure:

shutdown pmon docker ahead of syncd for "config reload" or "service swss restart" flow;
don't shutdown pmon docker ahead of syncd for "fast-reboot" or "warm-reboot" flow in order to save time.
for "reboot" flow, either order is acceptable.
Solution
To solve the issue, pmon shoud be stopped ahead of syncd stopped for all flows except for the warm-reboot.

- How I did it

To stop pmon ahead of syncd stopped. This is done in /usr/local/bin/syncd.sh::stop() and for all shutdown sequence.
Now pmon stops ahead of syncd so there must be a way in which pmon can start after syncd started. Another point that should be taken consideration is that pmon starting should be deferred so that services which have the logic of graceful restart in fast-reboot and warm-reboot have sufficient CPU cycles to meet their deadline.
This is done by add "syncd.service" as "After" to pmon.service and startin /usr/local/bin/syncd.sh::wait()
To start pmon automatically after syncd started.
2019-09-27 10:15:46 +02:00
Danny Allen
97c675c6d5 [cron.d] Add cron job to periodically clean-up core files (#3449)
* [cron.d] Create cron job to periodically clean-up core files
* Create script to scan /var/core and clean-up older core files
* Create cron job to run clean-up script

Signed-off-by: Danny Allen <daall@microsoft.com>

* Update interval for running cron job

* Respond to feedback

* Change syslog id
2019-09-13 10:50:31 -07:00
pavel-shirshov
8facac9149
[Fast-Reboot]: FR mode is active only first 3 minutes after start. (#3352)
* Fast reboot mode should be enabled only 3 minutes after restart

* Advance sonic-quagga submodule
2019-08-19 16:05:20 -07:00
Ying Xie
84b667fbaf
[radv service] radv service should be a cold only dependent of swss (#3348)
radv should be left alone during warm restart of swss. Otherwise it will
announce departure and cause hosts to lose default gateway.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-08-16 12:08:46 -07:00
Ying Xie
a46df66d05
[service dependent] describe non-warm-reboot dependency outside systemd (#3311)
* [service dependent] describe non-warm-reboot dependency outside systemctl

When dependency was described with systemctl, it will kick in all the time,
including under warm reboot/restart scenarios. This is not what we always
want. For components that are capable of warm reboot/start, they need to
describe dependency in service files.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [service] teamd service should not require swss service

Adding require swss will cause teamd to be killed by systemctl when swss
stops. This is not what we want in warm reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* refactoring code

* rename functions to match other functions in the file
2019-08-08 15:45:17 -07:00
Stepan Blyshchak
59117d23f0 [swss.sh]: Cleanup LAG entries in STATE DB (#3114)
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2019-07-08 17:29:57 -07:00
Stepan Blyshchak
6961816dec fix fast reboot compatibility (#3083)
* fix fast reboot compatibility

We should handle both cases for backward-compatible with 201803:
 - fast-reboot
 - SONIC_BOOT_TYPE=fast-reboot

* handle review comments
* add a comment that getBootType code snippet is shared between two files
2019-06-26 12:46:58 -07:00
Prince Sunny
231d309b69
Generate interface table to have an entry designated to default VRF. (#2848)
* Generate default VRF table for router interfaces

* Updated jinja2 template to have prefix filter
2019-06-10 14:02:55 -07:00
Nazarii Hnydyn
e041b15d10 [mellanox]: Fixed config reload race. (#2930)
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
2019-05-29 09:57:29 +03:00
Stepan Blyshchak
9523e64666 [swss.sh] flush FDB table during cold start (#2933)
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2019-05-22 22:07:29 -07:00
Joe LeVeque
6eca27e564 [services] Restart SwSS service upon unexpected critical process exit (#2845)
* [service] Restart SwSS Docker container if orchagent exits unexpectedly

* Configure systemd to stop restarting swss if it attempts to restart more than 3 times in 20 minutes

* Move supervisor-proc-exit-listener script

* [docker-dhcp-relay] Enhance wait_for_intf.sh.j2 to utilize STATEDB

* Ensure dependent services stop/start/restart with SwSS

* Change 'StartLimitInterval' to 'StartLimitIntervalSec', as Stretch installs systemd 232 (>= v230)

* Also update journald.conf options

* Remove 'PartOf' option from unit files

* Add '$(SUPERVISOR_PROC_EXIT_LISTENER_SCRIPT)' to new shared docker-orchagent makefile

* Make supervisor-proc-exit-listener script read from 'critical_processes' file inside container

* Update critical_processes file for swss container
2019-05-01 08:02:38 -07:00
Joe LeVeque
2bb5400948 [services] Services which start containers now use 'docker wait' instead of 'docker attach' (#2661) 2019-03-08 10:59:41 -08:00
Nazarii Hnydyn
b22fe37670 [mellanox]: Upgraded hw-management V.2.0.0160. (#2643)
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
2019-03-06 18:51:46 -08:00
Ying Xie
66f5202b9f
[swss/syncd] cold start syncd service in swss in attach method (#2639)
start() is called by service startPre method, which is blocking. Starting
syncd service here is causing deadlock.

attach() is called by service start method, which is non-blocking.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-03-04 16:46:55 -08:00
Joe LeVeque
5eb7872a07 [services] Ensure swss and syncd services start before dependent services (#2634)
* [services] Ensure swss and syncd services start before dependent services

* Add 'attach' functions to scripts which get installed to /usr/local/bin so that services only reference the one script each

* Add 'After=swss.service' to syncd.service
2019-03-02 15:28:34 -08:00
lguohan
572db1e0a9
[swss]: flush asic db in swss.sh for non warm-boot (#2582)
need to flush asic db in swss.sh instead of syncd.sh

orchagent might already started in swss.sh and put commands
into asic db before asic db is flushed in syncd.sh. This
causes race condition such as INIT_VIEW not passing to syncd.

Signed-off-by: Guohan Lu <gulv@microsoft.com>
2019-02-19 21:48:43 -08:00
Jipan Yang
ff74daaf13 Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABLE_TABLE (#2538)
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
2019-02-19 17:06:56 -08:00
Stepan Blyshchak
2dd769bf46 [syncd.sh] Don't stop sxdkernel during warm shutdown on Mellanox platform (#2572)
/etc/init.d/sxdkernel stop may take up to 15 sec which has impact on
control plane downtime

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2019-02-15 16:08:08 -08:00
Ying Xie
44551d0fb5
[swss/syncd] log swss/syncd service script activities (#2545)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2019-02-10 11:56:31 -08:00
Prince Sunny
39e12a1d82 [swss]: Change VrfMgrd startup order, cleanup VRF_TABLE from state DB (#2510) 2019-01-31 23:28:31 -08:00
stepanblyschak
ff526dd103 [mellanox|ffb] use system level warm reboot for Mellanox fastfast boot (#2374)
* [mellanox|ffb] use system level warm reboot for Mellanox fastfast boot

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [mellanox|ffb] add comments for mellanox start/stop drivers section

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2019-01-10 14:09:03 -08:00
Volodymyr Samotiy
b506241b84 [syncd]: Fix reload flow for Mellanox platforms (#2386)
* Perform stop/start of Mellanox driver tools for all types of reboot
* Don't set Mellanox FAST_BOOT option for "cold" reboot
* Don't send "syncd_request_shutdown" event for "cold" reboot on Mellanox platforms

Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
2018-12-15 11:36:12 -08:00
Volodymyr Samotiy
75b41233d2 [Mellanox|FFB]: Add support for Mellanox fast-fast boot (#2294)
* [mlnx|ffb] Add support for mellanox fast-fast boot

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

* [mlnx|ffb]: Add support of "config end" event for mlnx fast-fast boot

Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>

* [Mellanox|FFB]: Fix review comments

* Change naming convention from "fast-fast" to "fastfast"

Signed-off-by: Volodymyr Samotiy <volodymyrs@mellanox.com>
2018-12-04 10:11:24 -08:00
Ying Xie
4abbe43463 [syncd] skip ledinit during syncd warm start (#2285)
* [syncd] skip ledinit during syncd warm start

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-11-21 17:56:19 -08:00
Ying Xie
5c8650aaaa [swss service] don't clear WARM_RESTART table (#2256)
Clear WARM_RESTART table could cause component level warm restart to
fail due to missing WARM_RESTART state.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-11-15 22:04:53 -08:00
Ying Xie
8598ccaf84
[syncd] extend syncd service script to support both warm/cold shutdown (#2238)
- cold shutdown is used by regular service stop and/or fast reboot
- warm shutdown is used by warm restart and/or warm reboot

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-11-15 15:47:33 -08:00
stepanblyschak
447ae7b61a [mlnx] Fix fast reboot (#2237)
Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
2018-11-09 21:54:20 -08:00
Shuotian Cheng
110355201b [swss]: Update swss.sh script to clean up specific db when start (#2223)
This script shall not flush all the entries in the state database
when it starts up, since there are entries maintained and written
by other processes outside this docker.

The issue we noticed was that the portchannel states are cleaned
up after teamsyncd writes the entries into the database, which
causes the IPs failed to be configured because intfmgrd considers
the portchannels are not ready yet.

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
2018-11-03 12:32:46 -07:00
Ying Xie
f3ab8cdf9a [warm boot] syncd warm start could be individual warm start (#2147)
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
2018-10-16 11:20:39 -07:00
Kevin(Shengkai) Wang
ea4b4bd650 [mellanox]: Update recipe for hw-mgmt according to latest changes (#2128)
Update the hw-mgmt to latest release V.2.0.0060.
Update the related files according to the latest hw-mgmt.

Signed-off-by: Kevin Wang <kevinw@mellanox.com>
2018-10-08 18:33:44 -07:00
Jipan Yang
dedd5624a0 Adapt to the new WARM_RESTART_TABLE table schema: change from restart… (#2083)
* Adapt to the new WARM_RESTART_TABLE table schema: change from restart_count to restore_count

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* Update variable and function name to match restore_count name change

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* Update swss submodule for warm restart schema change

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
2018-10-02 06:08:26 -07:00
Ying Xie
c8e6b15504
[syncd] warn shutdown syncd process when warm boot is enabled (#2078)
* [syncd] warn shutdown syncd process when warm boot is enabled

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [warmboot] mount folder to hold warmboot temporary files

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* Fix a typo
2018-10-01 19:01:04 -07:00
Ying Xie
cfe01f19e4
Separate syncd service from swss service (#2051)
* [swss.sh] refactor ssh service script code

- Move checks and waits to helper functions.
- Remove early returns from code stream

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [swss.sh] Add debug log for service state changes

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [syncd] Separate out syncd service from swss service

Still make them start/stop/restart synchronously so existing scripts
continue working.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* Remove extra 'After' in swss service and remove syncd docker warm boot code

Syncd warm boot needs more thinking, we can put it back once the work
flow has been defined and ready for coding/testing.

* [syncd] syncd start/stop/restart shouldn't affect swss state

Semi-detach syncd service state change from swss:

- swss state change still chase syncd service to follow except warm boot
- syncd state change will only affect itself.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* add missing '{'
2018-09-24 16:35:01 -07:00
Jipan Yang
3f37b96de6 [swss]: Add support for swss docker warm restart (#1982)
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
2018-08-25 01:39:09 -07:00
lguohan
80c6453731
[swss]: simplify swss systemd service file (#1965)
move the swss service start/stop logic into /usr/local/bin/swss.sh

Signed-off-by: Guohan Lu <gulv@microsoft.com>
2018-08-22 13:02:32 -07:00
zhenggen-xu
d761630f73 Fix potential blackholing/looping traffic when link-local was used and refresh ipv6 neighbor to avoid CPU hit (#1904)
* Fix potential blackholing/looping traffic and refresh ipv6 neighbor to avoid CPU hit

In case ipv6 global addresses were configured on L3 interfaces and used for peering,
and routing protocol was using link-local addresses on the same interfaces as prefered nexthops,
the link-local addresses could be aged out after a while due to no activities towards the link-local
addresses themselves. And when we receive new routes with the link-local nexthops, SONiC won't insert
them to the HW, and thus cause looping or blackholing traffic.

Global ipv6 addresses on L3 interfaces between switches are refreshed by BGP keeplive and other messages.

On server facing side, traffic may hit fowarding plane only, and no refresh for the ipv6 neighbor entries regularly.
This could age-out the linux kernel ipv6 neighbor entries, and HW neighbor table entries could be removed,
and thus traffic going to those neighbors would hit CPU, and cause traffic drop and temperary CPU high load.

Also, if link-local addresses were not learned, we may not get them at all later.

It is intended to fix all above issues.

Changes:
Add ndisc6 package in swss docker and use it for ipv6 ndp ping to update the neighbors' state on Vlan interfaces
Change the default ipv6 neighbor reachable timer to 30mins
Add periodical ipv6 multicast ping to ff02::11 to get/refresh link-local neighbor info.

* Fix review comments:
Add PORTCHANNEL_INTERFACE interface for ipv6 multicast ping
format issue

* Combine regular L3 interface and portchannel interface for looping

* Add ndisc6 package to vs docker
2018-08-12 03:14:55 -07:00
pavel-shirshov
c52fb762dd
Convert arp_update into a 'start-it-once' mode (#1864)
* Run arp_update just once, don't restart it. It will run continuosly with 5 min pauses
2018-07-18 13:04:57 -07:00
Joe LeVeque
a36527a6a5
Store ConfigDB init indicator boolean value as 1/0 in Redis to be language-agnostic (#1352) 2018-01-30 15:04:52 -08:00
pavel-shirshov
8cfa223ef9 [scripts]: Fix issues with checking status of the DB. Use one approach everywhere. (#1323) 2018-01-18 19:55:11 -08:00
lguohan
b907e4e9f5
[vs]: add vlan configuration support in virtual switch (#1200) 2017-11-30 14:59:25 -08:00