- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
Avoid sonic-cfggen crashing when a server does not have a configured loopback address in the minigraph
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Server IPv4 loopbacks do not always arrive with /32 prefix, which is a requirement for the MUX_CABLE table in config DB
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
To make the peer switch hostname easily accessible from config DB. Add peer_switch field to DEVICE_METADATA table
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Why I did it
To support FG_ECMP scenarios
- How I did it
Modified minigraph parser to parse ECMP fields in the case they are present in minigraph
- How to verify it
Loaded ensuing config_db file on a DUT to verify the fields are parsed and configure device correctly
Mellanox already supports multiple destination IPs in IPinIP tunnel configuration, thus removing mellanox
exception for IPinIP configuration.
- How I did it
Removed "dst_ip" field generation in mellanox platform condition.
Sorted the "dst_ip" list, so that it is easier to test against sample configuration in unit tests.
Aligned unit test sample.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
frr does not advertise route if local route is not reachable, as a result
loopback route /64 is not advertised to the neighbors. Add static route
allows frr to advertise the route to its peers
Signed-off-by: Guohan Lu <lguohan@gmail.com>
importlib-resources v4.0.0 was released today (2020-12-23) and drops support for Python 2. This caused the sonic-config-engine Python 2 wheel build to fail.
Reference: https://pypi.org/project/importlib-resources/
Pin 'importlib-resources' package to v3.3.1 for Python 2
Unrelated: remove pinned version of zipp for sonic-bgpcfgd because we no longer build a Python 2 version of that package
- Why I did it
Latest master image crashes when loading minigraph
Fixing #6265
- How I did it
Avoid converting 'None' to ipaddress.
- How to verify it
On a system crashing with the issue, manually patch minigraph.py with the change in PR and load minigraph succeeded.
Signed-off-by: Ying Xie ying.xie@microsoft.com
* Parse device type from <ElementType> first in <PngDec>
* Fall back to <Device> type attribute if no <ElementType> is found
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
* Introduced a list console_device_types which contains the device types that support console management feature
* Inject CONSOLE_SWITCH:console_mgmt table with enabled:yes or enabled:no
Signed-off-by: Jing Kan jika@microsoft.com
Submodule updates include the following commits:
* src/sonic-utilities 9dc58ea...f9eb739 (18):
> Remove unnecessary calls to str.encode() now that the package is Python 3; Fix deprecation warning (#1260)
> [generate_dump] Ignoring file/directory not found Errors (#1201)
> Fixed porstat rate and util issues (#1140)
> fix error: interface counters is mismatch after warm-reboot (#1099)
> Remove unnecessary calls to str.decode() now that the package is Python 3 (#1255)
> [acl-loader] Make list sorting compliant with Python 3 (#1257)
> Replace hard-coded fast-reboot with variable. And some typo corrections (#1254)
> [configlet][portconfig] Remove calls to dict.has_key() which is not available in Python 3 (#1247)
> Remove unnecessary conversions to list() and calls to dict.keys() (#1243)
> Clean up LGTM alerts (#1239)
> Add 'requests' as install dependency in setup.py (#1240)
> Convert to Python 3 (#1128)
> Fix mock SonicV2Connector in python3: use decode_responses mode so caller code will be the same as python2 (#1238)
> [tests] Do not trim from PATH if we did not append to it; Clean up/fix shebangs in scripts (#1233)
> Updates to bgp config and show commands with BGP_INTERNAL_NEIGHBOR table (#1224)
> [cli]: NAT show commands newline issue after migrated to Python3 (#1204)
> [doc]: Update Command-Reference.md (#1231)
> Added 'import sys' in feature.py file (#1232)
* src/sonic-py-swsssdk 9d9f0c6...1664be9 (2):
> Fix: no need to decode() after redis client scan, so it will work for both python2 and python3 (#96)
> FieldValueMap `contains`(`in`) will also work when migrated to libswsscommon(C++ with SWIG wrapper) (#94)
- Also fix Python 3-related issues:
- Use integer (floor) division in config_samples.py (sonic-config-engine)
- Replace print statement with print function in eeprom.py plugin for x86_64-kvm_x86_64-r0 platform
- Update all platform plugins to be compatible with both Python 2 and Python 3
- Remove shebangs from plugins files which are not intended to be executable
- Replace tabs with spaces in Python plugin files and fix alignment, because Python 3 is more strict
- Remove trailing whitespace from plugins files
- Why I did it
The l2switch.j2 template does not include all fields for PORT. This could be incompatible with the 201911 image or later.
- How I did it
Update l2switch.j2 template and add a unit test.
Fix 259 alerts reported by the LGTM tool:
- 245 for Unused import
- 7 for Testing equality to None
- 5 for Duplicate key in dict literal
- 1 for Module is imported more than once
- 1 for Unused local variable
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
* Create new `PEER_SWITCH` table in config DB with info from minigraph
* Add `subtype` field to `DEVICE_METADATA` table and set value to `DualToR` if device is in a dual ToR setup
Treat devices that are ToRRouters (ToRRouters and BackEndToRRouters) the same when rendering templates
Except for BackEndToRRouters belonging to a storage cluster, since these devices have extra sub-interfaces created
Treat devices that are LeafRouters (LeafRouters and BackEndLeafRouters) the same when rendering templates
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
When forced mgmt routes are present, the issue fixed as part of #5754 is not complete.
Added a preference(priority) field to forced mgmt route ip rules
Python 3 is more strict with `__slots__`. As per the [documentation](https://docs.python.org/3/reference/datamodel.html#notes-on-using-slots):
> \_\_slots\_\_ are implemented at the class level by creating descriptors (Implementing Descriptors) for each variable name. As a result, class attributes cannot be used to set default values for instance variables defined by \_\_slots\_\_; otherwise, the class attribute would overwrite the descriptor assignment.
This was apparently missed when making sonic-config-engine compliant with Python 3, and errors like the following would be seen:
```
tests/acl_loader_test.py:10: in <module>
from acl_loader.main import *
acl_loader/main.py:8: in <module>
import openconfig_acl
/usr/local/lib/python3.7/dist-packages/openconfig_acl.py:24: in <module>
class yc_state_openconfig_acl__acl_state(PybindBase):
E ValueError: '_pybind_generated_by' in __slots__ conflicts with class variable
```
Take tunnel info from `<TunnelInterface>` tag in minigraph, and create tables in config_DB:
```
"TUNNEL": {
"MUX_TUNNEL_0": {
"tunnel_type": "IPINIP",
"dst_ip": "26.1.1.10",
"dscp_mode": "uniform",
"encap_ecn_mode": "standard",
"ecn_mode": "copy_from_outer",
"ttl_mode": "pipe"
}
}
```
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Fix#5812
LLDP conf Jinja2 Template does not verify IPv4 address and can use IPv6 version. This issue does not effect control LLDP daemon. Issue can be reproduced via `test_snmp_lldp` test. LLDP conf Jinja2 Template selects first item from the list of mgmt interfaces.
TESTBED_1 LLDP conf
```
# cat /etc/lldpd.conf
configure ports eth0 lldp portidsubtype local eth0
configure system ip management pattern FC00:3::32
configure system hostname dut-1
```
TESTBED_2 LLDP conf
```
# cat /etc/lldpd.conf
configure ports eth0 lldp portidsubtype local eth0
configure system ip management pattern 10.22.24.61
configure system hostname dut-2
```
TESTBED_1 MGMT_INTERFACE
```
$ redis-cli -n 4 keys "*" | grep MGMT_INTERFACE
MGMT_INTERFACE|eth0|10.22.24.53/23
MGMT_INTERFACE|eth0|FC00:3::32/64
```
TESTBED_2 MGMT_INTERFACE
```
$ redis-cli -n 4 keys "*" | grep MGMT_INTERFACE
MGMT_INTERFACE|eth0|FC00:3::32/64
MGMT_INTERFACE|eth0|10.22.24.61/23
```
Signed-off-by: Petro Bratash <petrox.bratash@intel.com>
FixAzure/SONiC#551
When eth0 IP address is configured, an ip rule is getting added for eth0 IP address through the interfaces.j2 template.
This eth0 ip rule creates an issue when VRF (data VRF or management VRF) is also created in the system.
When any VRF (data VRF or management VRF) is created, a new rule is getting added automatically by kernel as "1000: from all lookup [l3mdev-table]".
This l3mdev IP rule is never getting deleted even if VRF is deleted.
Once if this l3mdev IP rule is added, if user configures IP address for the eth0 interface, interfaces.j2 adds an eth0 IP rule as "1000:from 100.104.47.74 lookup default ". Priority 1000 is automatically chosen by kernel and hence this rule gets higher priority than the already existing rule "1001:from all lookup local ".
This results in an issue "ping from console to eth0 IP does not work once if VRF is created" as explained in Issue 551.
More details and possible solutions are explained as comments in the Issue551.
This PR is to resolve the issue by always fixing the low priority 32765 for the IP rule that is created for the eth0 IP address.
Tested with various combinations of VRF creation, deletion and IP address configuration along with ping from console to eth0 IP address.
Co-authored-by: Kannan KVS <kannan_kvs@dell.com>
* Initial commit for BGP internal neighbor table support.
> Add new template named "internal" for the internal BGP sessions
> Add a new table in database "BGP_INTERNAL_NEIGHBOR"
> The internal BGP sessions will be stored in this new table "BGP_INTERNAL_NEIGHBOR"
* Changes in template generation tests with the introduction of internal neighbor template files.
* Fix for LLDP advertisments being sent with wrong information.
Since lldpd is starting before lldpmgr, some advertisment packets might sent with default value, mac address as Port ID.
This fix hold the packets from being sent by the lldpd until all interfaces are well configured by the lldpmgrd.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
* Fix comments
* Fix unit-test output caused a failure during build
* Add 'run_cmd' function and use it
* Resume lldpd even if port init timeout reached
Find LogicalLinks in minigraph and parse the port information. A new field called `mux_cable` is added to each port's entry in the Port table in config DB:
```
PORT|Ethernet0: {
"alias": "Ethernet4/1"
...
"mux_cable": "true"
}
```
If a mux cable is present on a port, the value for `mux_cable` will be `"true"`. If no mux cable is present, the attribute will either be omitted (default behavior) or set to `"false"`.
Issue was because we were relying on port_alias_asic_map dictionary
but that dictionary can't be used as alias name format has changed.
Fix the port alias mapping as what is needed.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
This PR makes two changes:
- Store Jinja2 cache in LOGLEVEL DB instead of STATE DB
- Store bytecode cache encoded in base64
Tested with the following command: "redis-dump -d 3 -k JINJA2_CACHE"
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
**- Why I did it**
FRR introduced [next hop tracking](http://docs.frrouting.org/projects/dev-guide/en/latest/next-hop-tracking.html) functionality.
That functionality requires resolving BGP neighbors before setting BGP connection (or explicit ebgp-multihop command). Sometimes (BGP MONITORS) our neighbors are not directly connected and sessions are IBGP. In this case current configuration prevents FRR to establish BGP connections. Reason would be "waiting for NHT". To fix that we need either add static routes for each not-directly connected ibgp neighbor, or enable command `ip nht resolve-via-default`
**- How I did it**
Put `ip nht resolve-via-default` into the config
**- How to verify it**
Build an image. Enable BGP_MONITOR entry and check that entry is Established or Connecting in FRR
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
* Install ndppd during image build, and copy config files to image
* Configure proxy settings based on config DB at container start
* Pipe ndppd output to logger inside container to log output in syslog
* Fix generate_l2_config: don't override hostname because sonic-cfggen may not read from Redis. Fix test_l2switch_template test case to test preset l2 feature.
* Improve test script: compare json files with sort_keys
* Revert changes on sample_output
* Remove members field in VLAN section. Fix test assertTrue statement.
Avoiding recursive update of maps as it consumes stack frames. This
PR introduces iterative version of deep_update method.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Rendering templates show different orders when rendered using Python
2 vs Python 3. This PR prepare for Python 3 packaging by creating
new dir 'py2' for Python 2 rendered test cases.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Jinja2 templates rendered using Python 3 interpreter, are required
to conform with Python 3 new semantics.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Natural sorting of SONiC config gen output consumes lot of CPU cycles.
The sole use of natsorted was to make test comparison easier and so,
the natsorting logic is now relocated to the test suite. As a result
sonic-cfggen gained nearly 1 sec per call since we no longer import
natsorted module!
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
As per the VOQ HLDs, internal networking between the linecards and supervisor is required within a chassis.
Allocating 127.X/16 subnets for private communication within a chassis is a good candidate.
It doesn't require any external IP allocation as well as ensure that the traffic will not leave the chassis.
References:
https://github.com/Azure/SONiC/pull/622https://github.com/Azure/SONiC/pull/639
**- How I did it**
Changed the `interfaces.j2` file to add `127.0.0.1/16` as the `lo` ip address.
Then once the interface is up, the post-up command removes the `127.0.0.1/8` ip address.
The order in which the netmask change is made matters for `127.0.0.1` to be reachable at all times.
**- How to verify it**
```
root@sonic:~# ip address show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/16 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
```
Co-authored-by: Baptiste Covolato <baptiste@arista.com>