Why I did it
ndppd by default reads /proc/net/ipv6_route ever 30 seconds. Since T1s advertise so many routes to ToRs, this file is extremely large, and reading it causes ndppd's CPU usage to spike every 30 seconds
How I did it
Increase the delay for reading this file to the maximum possible value (max integer value), which will result in CPU spikes every ~24 days instead of every 30 seconds
How to verify it
Start ndppd with the new config file, confirm that no CPU spikes are seen except at startup
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
The aiohttp package is required by azure.kusto.data which is used by sonic-mgmt/test_reporting.
This change is to ensure that the dependent package is installed in the sonic-mgmt docker.
Signed-off-by: Xin Wang <xiwang5@microsoft.com>
Why I did it
Support readonly version of the command vtysh
How I did it
Check if the command starting with "show", and verify only contains single command in script.
* Why I did it
Upgrade to the latest ixnetwork-restpy and ixnetwork-open-traffic-generator pypi packages
* How I did it
Updated the pip install entries for the packages in the Dockerfile.j2
* How to verify it
pip show ixnetwork-restpy
pip show ixnetwork-open-traffic-generator
Co-authored-by: Neetha John <nejo@microsoft.com>
For the split mode, the config files, like bgpd.conf, zebra.conf and so on, were provided by outside. But the docker_init.sh will overwrite the outside config files if restart bgp service.
How I did it
Add a split mode checking in docker_init.sh, if docker_routing_config_mode is split, don't overwrite the existing routing config files.
How to verify it
Set split mode in config db
{
"DEVICE_METADATA": {
"localhost": {
"hwsku": "Force10-S6000",
"platform": "x86_64-kvm_x86_64-r0",
"docker_routing_config_mode": "split"
...
}
}
}
Replace your bgpd.conf to /etc/sonic/frr/bgpd.conf
Restart bgp service by sudo service bgp restart
The /etc/sonic/frr/bgpd.conf your provided shouldn't be overwritten
Signed-off-by: Ze Gan <ganze718@gmail.com>
- Support compile sonic arm image on arm server. If arm image compiling is executed on arm server instead of using qemu mode on x86 server, compile time can be saved significantly.
- Add kernel argument systemd.unified_cgroup_hierarchy=0 for upgrade systemd to version 247, according to #7228
- rename multiarch docker to sonic-slave-${distro}-march-${arch}
Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
Co-authored-by: Shi Lei <shil@centecnetworks.com>
This commit has following changes:
* Add templates and code to support VoQ chassis iBGP peers
* Add support to convert a new VoQChassisInternal element in the
BGPSession element of the minigraph to a new BGP_VOQ_CHASSIS_NEIGHBOR
table in CONFIG_DB.
* Add a new set of "voq_chassis" templates to docker-fpm-frr
* Add a new BGP peer manager to bgpcfgd to add neighbors from the
BGP_VOQ_CHASSIS_NEIGHBOR table using the voq_chassis templates.
* Add a test case for minigraph.py, making sure the VoQChassisInternal
element creates a BGP_VOQ_CHASSIS_NEIGHBOR entry, but not if its
value is "false".
* Add a set of test cases for the new voq_chassis templates in
sonic-bgpcfgd tests.
Note that the templates expect the new
"bgp bestpath peer-type multipath-relax" bgpd configuration to be
available.
Signed-off-by: Joanne Mikkelson <jmmikkel@arista.com>
With the latest 201911 image, the following error was seen on staging devices with TSB command ( for both single asic, multi asic ). Though this err message doesn't affect the TSB functionality, it is good to fix.
admin@STG01-0101-0102-01T1:~$ TSB
BGP0 : % Could not find route-map entry TO_TIER0_V4 20
line 1: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 permit 20
% Could not find route-map entry TO_TIER0_V4 30
line 2: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 deny 30
In addition, in this PR I am fixing the message displayed to user when there are no BGP neighbors configured on that BGP instance. In multi-asic device there could be case where there are no BGP neighbors configured on a particular ASIC.
Avoid the following error messages while dynamic buffer calculation is enabled
```
ERR monit[491]: 'swss|buffermgrd' status failed (1) -- '/usr/bin/buffermgrd -l' is not running in host
```
Change /usr/bin/buffermgrd -l to /usr/bin/buffermgrd. The buffermgrd is started by -l for traditional model or -a for dynamic model. So we need to use the common section of both.
Signed-off-by: Stephen Sun <stephens@nvidia.com>
The default bgp connect retry timer is 120 seconds. A reconnection will happen 120 seconds if the initial connection fails. This PR aims to allow a more frequent retry.
Why I did it
It was observed that on a multi-asic DUT bootup, the BGP internal sessions between ASIC's was taking more time to get ESTABLISHED than external BGP sessions. The internal sessions was coming up almost exactly 120 secs later.
In multi-asic platform the bgp dockers ( which is per ASIC ) on switch start are bring brought up around the same time and they try to make the bgp sessions with neighbors (in peer ASIC's) which may be not be completely up. This results in BGP connect fail and the retry happens after 120sec which is the default Connect Retry Timer
How I did it
Add the command to set the bgp neighboring session retry timer to 10sec for internal bgp neighbors.
#### Why I did it
It is possible to have DHCP relay configuration with no servers/
helpers which result in DHCP container to crash. This PR fixes this
issue by not starting DHCP relay for vlans with no DHCP helpers.
resolves: #6931closes: #6931
#### How I did it
Do not add program group for dhcp relay with not dhcp helpers
#### How to verify it
Unit test
1. Made the command next-hop-self force only applicable on back-end asic bgp. This is done so that BGPL iBGP session running on backend can send e-BGP learn nexthop. Back end asic FRR is able to recursively resolve the eBGP nexthop in its routing table since it knows about all the connected routes advertise from front end asic.
2. Made all front-end asic bgp use global loopback ip (Loopback0) as router id and back end asic bgp use Loopbacl4096 as ruter-id and originator id for Route-Reflector. This is done so that routes learnt by external peer do not see Loopback4096 as router id in show ip bgp <route-prerfix> output.
3. To handle above change need to pass Loopback4096 from BGP manager for jinja2 template generation. This was missing and this change/fix is needed for this also https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-fpm-frr/frr/bgpd/templates/dynamic/instance.conf.j2#L27
4. Enhancement to add mult_asic specific bgpd template generation unit test cases.
Enable BBR config allowas-in 1 for internal peers
Why I did:
To advertise BBR routes learnt via e-BGP peer in one asic/namespace to another iBGP asic/namespace via Route Reflector.
adding noTLS mode for debugging purpose
Removing config-set for port 8080. It fails to start telemetry if docker restarts in case on noTLS mode because it expects log_level config to be present as well.
- Introduced TS common file in docker as well and moved common functions.
- TSA/B/C scripts run only in BGP instances for front end ASICs.
In addition skip enforcing it on route maps used between internal BGP sessions.
admin@str--acs-1:~$ sudo /usr/bin/TSA
System Mode: Normal -> Maintenance
and in case of Multi-ASIC
admin@str--acs-1:~$ sudo /usr/bin/TSA
BGP0 : System Mode: Normal -> Maintenance
BGP1 : System Mode: Normal -> Maintenance
BGP2 : System Mode: Normal -> Maintenance
The requirement for zebra to be ready to accept connections is a generic problem that is not
specific to bgpd. Making the script to wait for zebra socket a separate script and let bgpd and
staticd to wait for zebra socket.
- combine docker-ptf-saithrift into docker-ptf docker
- build docker-ptf under platform vs
- remove docker-ptf for other platforms
Signed-off-by: Guohan Lu <lguohan@gmail.com>
Recent changes brought l2 vlan concept which do not have DHCP
clients behind them and so DHCP relay is not required. Also,
dhcpmon fails to launch on those vlans as their interfaces
lack IP addresses. This PR limit launch of both DHCP relay
and dhcpmon to L3 vlans only.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
- Support for non-template based FRR configurations (BGP, route-map, OSPF, static route..etc) using config DB schema.
- Support for save & restore - Jinja template based config-DB data read and apply to FRR during startup
**- How I did it**
- add frrcfgd service
- when frr_mgmg_framework_config is set, frrcfgd starts in bgp container
- when user changed the BGP or other related table entries in config DB, frrcfgd will run corresponding VTYSH commands to program on FRR.
- add jinja template to generate FRR config file to be used by FRR daemons while bgp container restarted
**- How to verify it**
1. Add/delete data on config DB and then run VTYSH "show running-config" command to check if FRR configuration changed.
1. Restart bgp container and check if generated FRR config file is correct and run VTYSH "show running-config" command to check if FRR configuration is consistent with attributes in config DB
Co-authored-by: Zhenhong Zhao <zhenhong.zhao@dell.com>
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com
- Why I did it
This PR has the changes to support having different swss.rec and sairedis.rec for each asic.
The logrotate script is updated as well
- How I did it
Update the orchagent.sh script to use the logfile name options in these PRs(Azure/sonic-swss#1546 and Azure/sonic-sairedis#747)
In multi asic platforms the record files will be different for each asic, with the format swss.asic{x}.rec and sairedis.asic{x}.rec
Update the logrotate script for multiasic platform .
**- Why I did it**
Ledd is the last daemon that is not enabled to run in python3.
Even though there is a plan to deprecate this daemon and to replace it by something else it's one simple step toward python2 deprecation.
**- How I did it**
Changed the `command=` line for `ledd` in the `supervisord` configuration of `pmon`.
Copied what was done for other daemons.
**- How to verify it**
Booting a product that has a `led_control.py` should now show the ledd running in python3.
I ran `python3 -m pylint` on all `led_control.py` plugin which means that most of them should be python3 compliant.
There is however still a risk that some might not work.
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
Fix#5026
There is a race condition between zebra server accepts connections and bgpd tries to connect. Bgpd has a chance to try to connect before zebra is ready. In this scenario, bgpd will try again after 10 seconds and operate as normal within these 10 seconds. As a consequence, whatever bgpd tries to sent to zebra will be missing in the 10 seconds. To avoid such a scenario, bgpd should start after zebra is ready to accept connections.
**- Why I did it**
Earlier today we found a bug in the SONiC TSA implementation.
TSC shows incorrect output (see below) in case we have a route-map which contains TSA route-map as a prefix.
```
admin@str-s6100-acs-1:~$ TSC
Traffic Shift Check:
System Mode: Not consistent
```
The reason is that TSC implementation has too loose regexps in TSA utilities, which match wrong route-map entries:
For example, current TSC matches following
```
route-map TO_BGP_PEER_V4 permit 200
route-map TO_BGP_PEER_V6 permit 200
```
But it should match only
```
route-map TO_BGP_PEER_V4 permit 20
route-map TO_BGP_PEER_V4 deny 30
route-map TO_BGP_PEER_V6 permit 20
route-map TO_BGP_PEER_V6 deny 30
```
**- How I did it**
I fixed it by using egrep with `^` and `$` regexp markers which match begin and end of the line.
**- How to verify it**
1. Add follwing entry to FRR config:
```
str-s6100-acs-1#
str-s6100-acs-1# conf t
str-s6100-acs-1(config)# route-map TO_BGP_PEER_V4 permit 200
str-s6100-acs-1(config-route-map)# end
```
2. Use the TSC command and check output. It should show normal.
```
admin@str-s6100-acs-1:~$ TSC
Traffic Shift Check:
System Mode: Normal```
fix platform driver breakage due to python3 upgrade and fix load minigraph errors with config load_minigraph -y
**- How I did it**
added python3-smbus to the pmon docker template since the previous was python2 specific
fixed additional "ord" python2 specific code
fixed the jinja templates used by qos reload - the template logic required data to be parsed
**- How to verify it**
run "show platform XXX" commands and verify output
run "sudo config load_minigraph -y" and verify configuration
run "show interfaces XXX" and verify output
Co-authored-by: Carl Keene <keene@nokia.com>
The HLD about MACsec feature is at :
https://github.com/Azure/SONiC/blob/master/doc/macsec/MACsec_hld.md
- How to verify it
This PR doesn't set MACsec container automatically start, You should manually start the container by docker run docker-macsec
wpa_supplicant binary can be found at MACsec container.
This PR depends on the PR, WPA_SUPPLICANT, and The MACsec container will be set as automatically start by later PR.
Signed-off-by: zegan <zegan@microsoft.com>
* Use 20 and 30 route-map entries instead of 2 and 3 for TSA
* Added support for dynamic "Allow list" default action.
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
- Make PDDF code compliant with both Python 2 and Python 3
- Align code with PEP8 standards using autopep8
- Build and install both Python 2 and Python 3 PDDF packages
Enable the notify mode of rsyslogd imfile module used for supervisord
logs in docker container.
Setup the mode="inotify" when loading imfile, made sure we are are getting
supervisord logs in host immediately.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
**- Why I did it**
I'm updating the jinja2 template to support getting SNMP information from the redis configdb.
I'm using the format approved here:
https://github.com/Azure/SONiC/pull/718
This will pave the way for us to decrement using the snmp.yml in the future.
Right now we will still be using both the snmp.yml and configdb to get variable information in order to create the snmpd.conf via the sonic-cfggen tool.
**- How I did it**
I first updated the SNMP Schema in PR #718 to get that approved as a standardized format.
Then I verified I could add snmp configs to the configdb using this standard schema. Once the configs were added to the configdb then I updated the snmpd.conf.j2 file to support the updates via the configdb while still using the variables in the snmp.yml file in parallel. This way we will have backward compatibility until we can fully migrate to the configdb only.
By updating the snmpd.conf.j2 template and running the sonic-cfggen tool the snmpd.conf gets generated with using the values in both the configdb and snmp.yml file.
Co-authored-by: trvanduy <trvanduy@microsoft.com>
Mellanox already supports multiple destination IPs in IPinIP tunnel configuration, thus removing mellanox
exception for IPinIP configuration.
- How I did it
Removed "dst_ip" field generation in mellanox platform condition.
Sorted the "dst_ip" list, so that it is easier to test against sample configuration in unit tests.
Aligned unit test sample.
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
This PR is in preparation to move from snmp.yml to configdb. This will more closely align with other commands in sonic and use configdb as the source of truth for snmp configuration.
Note: This is the first of 2 PR's to enable this. This PR will not change any functionality but will allow the snmp.yml file info to be put into the configdb.
Created a script that takes the snmp.yml variables and converts them to the configdb format.
Added file to dockerfile.j2 so that file is copied in the container.
Updated start.sh file to automatically run the python conversion script each time the docker container is restarted.
frr does not advertise route if local route is not reachable, as a result
loopback route /64 is not advertised to the neighbors. Add static route
allows frr to advertise the route to its peers
Signed-off-by: Guohan Lu <lguohan@gmail.com>
**- Why I did it**
python2 is end of life and SONiC is going to support python3. This PR is going to support:
1. Build pmon daemons with python3
2. Install and run python3 version pmon daemons
**- How I did it**
1. Change pmon daemons make files to build bothe python2 and python3 whl
2. Change docker-platform-monitor make files to install both python2 and python3 whl
3. Change pmon docker startup files to start pmon daemons according to the supported platform API version
Introduce tunnel manager daemon. Start the process as part of swss container
Submodule update for swss:
9ed3026 - 2020-12-24 : [NAT] ACL Rule with DO_NOT_NAT action is getting failed. (#1502) [Akhilesh Samineni]
c39a4b1 - 2020-12-23 : Mux/IPTunnel orchagent changes (#1497) [Prince Sunny]
bc8df0e - 2020-12-23 : Add support for headroom pool watermark (#1567) [Neetha John]
**- Why I did it**
As part of migrating SONiC codebase from Python 2 to Python 3
**- How I did it**
- No longer install Python 2 in docker-base-buster or docker-config-engine-buster.
- Install Python 2 and pip2 in the following containers until we can completely eliminate it there:
- docker-platform-monitor
- docker-sonic-mgmt-framework
- docker-sonic-vs
- Pin pip2 version <21 where it is still temporarily needed, as pip version 21 will drop support for Python 2
- Also preform some other cleanup, ensuring that pip3, setuptools and wheel packages are installed in docker-base-buster, and then removing any attempts to re-install them in derived containers
* First cut image update for kubernetes support.
With this,
1) dockers dhcp_relay, lldp, pmon, radv, snmp, telemetry are enabled
for kube management
init_cfg.json configure set_owner as kube for these
2) Each docker's start.sh updated to call container_startup.py to register going up
As part of this call, it registers the current owner as local/kube and its version
The images are built with its version ingrained into image during build
3) Update all docker's bash script to call 'container start/stop/wait' instead of 'docker start/stop/wait'.
For all locally managed containers, it calls docker commands, hence no change for locally managed.
4) Introduced a new ctrmgrd service, that helps with transition between owners as kube & local and carry over any labels update from STATE-DB to API server
5) hostcfgd updated to handle owner change
6) Reboot scripts are updatd to tag kube running images as local, so upon reboot they run the same image.
7) Added kube_commands.py to handle all updates with Kubernetes API serrver -- dedicated for k8s interaction only.
HLD: Azure/SONiC#646
In modular chassis, add CHASSIS_STATE_DB on control card
Why I did it
Modular Chassis has control-cards, line-cards and fabric-cards along with other peripherals. Control-Card CHASSIS_STATE_DB will be the central DB to maintain any state information of cards that is accessible to control-card/
How I did it
Adding another DB on an existing REDIS instance running on port 6380.
HLD: Azure/SONiC#646
Introducing chassisd process to monitor status of the control, line and fabric cards in a modular chassis.
- Why I did it
Modular Chassis has control-cards, line-cards and fabric-cards along with other peripherals. Chassisd will be a central entity that has visibility of the entire chassis.
- How I did it
Chassisd process will monitor cards in the main thread. Another configuation_handling_task is created to listen to CONFIG_DB for admin_status up/down events. The monitored status is persisted in REDIS-DB.
Install the necessary python3 dependent packages to convert restore_neighbor.py
to support python3 as python2 is EOL. See: Azure/sonic-swss#1542
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
libxslt-dev and libz-dev are dependencies for lxml==4.6.1 which is required for pyangbind==0.8.1
lxml-4.6.2-cp37-cp37m-manylinux1_x86_64.whl is directly downloaded in amd64 whereas in arm this is built from lxml-4.6.2.tar.gz
Signed-off-by: Sabareesh Kumar Anandan <sanandan@marvell.com>
**- Why I did it**
To support dynamic buffer calculation.
This PR also depends on the following PRs for sub modules
- [sonic-swss: [buffermgr/bufferorch] Support dynamic buffer calculation #1338](https://github.com/Azure/sonic-swss/pull/1338)
- [sonic-swss-common: Dynamic buffer calculation #361](https://github.com/Azure/sonic-swss-common/pull/361)
- [sonic-utilities: Support dynamic buffer calculation #973](https://github.com/Azure/sonic-utilities/pull/973)
**- How I did it**
1. Introduce field `buffer_model` in `DEVICE_METADATA|localhost` to represent which buffer model is running in the system currently:
- `dynamic` for the dynamic buffer calculation model
- `traditional` for the traditional model in which the `pg_profile_lookup.ini` is used
2. Add the tables required for the feature:
- ASIC_TABLE in platform/\<vendor\>/asic_table.j2
- PERIPHERAL_TABLE in platform/\<vendor\>/peripheral_table.j2
- PORT_PERIPHERAL_TABLE on a per-platform basis in device/\<vendor\>/\<platform\>/port_peripheral_config.j2 for each platform with gearbox installed.
- DEFAULT_LOSSLESS_BUFFER_PARAMETER and LOSSLESS_TRAFFIC_PATTERN in files/build_templates/buffers_config.j2
- Add lossless PGs (3-4) for each port in files/build_templates/buffers_config.j2
3. Copy the newly introduced j2 files into the image and rendering them when the system starts
4. Update the CLI options for buffermgrd so that it can start with dynamic mode
5. Fetches the ASIC vendor name in orchagent:
- fetch the vendor name when creates the docker and pass it as a docker environment variable
- `buffermgrd` can use this passed-in variable
6. Clear buffer related tables from STATE_DB when swss docker starts
7. Update the src/sonic-config-engine/tests/sample_output/buffers-dell6100.json according to the buffer_config.j2
8. Remove buffer pool sizes for ingress pools and egress_lossy_pool
Update the buffer settings for dynamic buffer calculation
* restoring each database with all data before warmboot and then flush unused data in each instance, following the multiDB warmboot design at https://github.com/Azure/SONiC/blob/master/doc/database/multi_database_instances.md
* restore needs to be done in database docker since we need to know the database_config.json in new version
* copy all data rdb file into each instance restoration location andthen flush unused database
* other logic is the same as before
* backing up database part is in another PR at sonic-utilities https://github.com/Azure/sonic-utilities/pull/1205, they depend on each other
The service crash when the platform boots due to missing waits.
/usr/bin/database.sh tries to operate on a missing socket and fails.
We now wait for the chassis database to be ready the same way we do database.
Barefoot platform vendors' sonic_platform packages import the Python 'thrift' library. Previously, our custom-built package was being installed in the PMon container and host OS. However, we are only building a Python 2 version of that package, which was only intended for use with saithrift.
Fixes#6077
**- Why I did it**
Align style with slightly modified PEP8 standards (extend maximum line length to 120 chars). This will also help in the transition to Python 3, where it is more strict about whitespace, plus it helps unify style among the SONiC codebase. Will tackle other directories in separate PRs.
**- How I did it**
Using `autopep8 --in-place --max-line-length 120` and some manual tweaks.
Made changes so that Lldp docker start using py3 of sonic-db-syncd
submodule update sonic-db-syncd
5cc29a1b32d8d1f4dfbc967bfea2727c50a49c76 (HEAD -> master, origin/master, origin/HEAD) Changes to convert sonic-dbsyncd from python 2 to 3
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
Fixed TSA bugs:
1. TSA didn't advertise Loopback ipv6 address
2. TSA and TSB changed BGP dynamic and BGP monitors sessions
**- How to verify it**
Build an image and run on your DUT.
```
admin@str-s6100-acs-1:~$ TSA
System Mode: Normal -> Maintenance
admin@str-s6100-acs-1:~$ vtysh -c 'show bgp ipv4 neighbors 10.0.0.1 advertised-routes'
BGP table version is 6, local router ID is 10.1.0.32, vrf id 0
Default local pref 100, local AS 64601
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 10.1.0.32/32 0.0.0.0 0 32768 i
Total number of prefixes 1
admin@str-s6100-acs-1:~$ vtysh -c 'show bgp ipv6 neighbors fc00::a advertised-routes'
BGP table version is 6, local router ID is 10.1.0.32, vrf id 0
Default local pref 100, local AS 64601
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> fc00:1::/64 :: 0 32768 i
Total number of prefixes 1
admin@str-s6100-acs-1:~$ TSB
System Mode: Maintenance -> Normal
```
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
This change introduces PDDF which is described here: https://github.com/Azure/SONiC/pull/536
Most of the platform bring up effort goes in developing the platform device drivers, SONiC platform APIs and validating them. Typically each platform vendor writes their own drivers and platform APIs which is very tailor made to that platform. This involves writing code, building, installing it on the target platform devices and testing. Many of the details of the platform are hard coded into these drivers, from the HW spec. They go through this cycle repetitively till everything works fine, and is validated before upstreaming the code.
PDDF aims to make this platform driver and platform APIs development process much simpler by providing a data driven development framework. This is enabled by:
JSON descriptor files for platform data
Generic data-driven drivers for various devices
Generic SONiC platform APIs
Vendor specific extensions for customisation and extensibility
Signed-off-by: Fuzail Khan <fuzail.khan@broadcom.com>
1. Update SSL ca certificates for secure download [arm specific]
2. Using redis-tools from blob sonic-storage for docker-base-stretch
Signed-off-by: Sabareesh Kumar Anandan <sanandan@marvell.com>
Treat devices that are ToRRouters (ToRRouters and BackEndToRRouters) the same when rendering templates
Except for BackEndToRRouters belonging to a storage cluster, since these devices have extra sub-interfaces created
Treat devices that are LeafRouters (LeafRouters and BackEndLeafRouters) the same when rendering templates
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
*Why I did it
Provide access to the ixnetwork-open-traffic-generator pypi package
*How I did it
Added a pip install entry for the package in the Dockerfile.j2
*How to verify it
pip show ixnetwork-open-traffic-generator
*Description for the changelog
Install of ixnetwork-open-traffic-generator pypi package is required for proposed rdma tests.
- Why I did it
Update the routine is_bgp_session_internal() by checking the BGP_INTERNAL_NEIGHBOR table.
Additionally to address the review comment #5520 (comment)
Add timer settings as will in the internal session templates and keep it minimal as these sessions which will always be up.
Updates to the internal tests data + add all of it to template tests.
- How I did it
Updated the APIs and the template files.
- How to verify it
Verified the internal BGP sessions are displayed correctly with show commands with this API is_bgp_session_internal()
lldpmgrd, bgpcfgd, and bgpmon are reported error status not running due
to recent change of shebang to use `Python3`. Modifying the argument of
`process_checker` to follow this change.
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
* Fix some spelling error and add addition check in port_index_mapper.py script.
* Remove unneeded pyroute2 python module from sFlow container
Signed-off-by: Garrick He <garrick_he@dell.com>
Fix#5812
LLDP conf Jinja2 Template does not verify IPv4 address and can use IPv6 version. This issue does not effect control LLDP daemon. Issue can be reproduced via `test_snmp_lldp` test. LLDP conf Jinja2 Template selects first item from the list of mgmt interfaces.
TESTBED_1 LLDP conf
```
# cat /etc/lldpd.conf
configure ports eth0 lldp portidsubtype local eth0
configure system ip management pattern FC00:3::32
configure system hostname dut-1
```
TESTBED_2 LLDP conf
```
# cat /etc/lldpd.conf
configure ports eth0 lldp portidsubtype local eth0
configure system ip management pattern 10.22.24.61
configure system hostname dut-2
```
TESTBED_1 MGMT_INTERFACE
```
$ redis-cli -n 4 keys "*" | grep MGMT_INTERFACE
MGMT_INTERFACE|eth0|10.22.24.53/23
MGMT_INTERFACE|eth0|FC00:3::32/64
```
TESTBED_2 MGMT_INTERFACE
```
$ redis-cli -n 4 keys "*" | grep MGMT_INTERFACE
MGMT_INTERFACE|eth0|FC00:3::32/64
MGMT_INTERFACE|eth0|10.22.24.61/23
```
Signed-off-by: Petro Bratash <petrox.bratash@intel.com>
- Convert restore_nat_entries.py script to Python 3
- Use logger from sonic-py-common for uniform logging
- Reorganize imports alphabetically per PEP8 standard
- Two blank lines precede functions per PEP8 standard
**- Why I did it**
A memory issue was discovered during system test for scaling. The issue is documented here: https://docs.pyroute2.org/ipdb.html
> One of the major issues with IPDB is its memory footprint. It proved not to be suitable for environments with thousands of routes or neighbours. Being a design issue, it could not be fixed, so a new module was started, NDB, that aims to replace IPDB. IPDB is still more feature rich, but NDB is already more fast and stable.
**- How I did it**
- Rewrote the port_index_mapper.py script to use dB events.
- Convert to Python 3
**- Why I did it**
As part of moving all SONiC code from Python 2 (no longer supported) to Python 3
**- How I did it**
- Convert enable_counters.py script to Python 3
- Reorganize imports per PEP8 standard
- Two blank lines precede functions per PEP8 standard
* Convert bgpcfgd to python3
Convert bgpmon to python3
Fix some issues in bgpmon
* Add python3-swsscommon as depends
* Install dependencies
* reorder deps
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
Recent changes to dependencies caused the 'enum34' package to cease being installed for Python 2 in the PMon container. This broke Arista platforms, where the Arista sonic_platform package imports 'enum'. This is because on Arista devices, the sonic_platform wheel is not installed in the container. Instead, the installation directory is mounted from the host OS. However, this method doesn't ensure all dependencies are installed in the container.
Prevent intermittent build failures when building Sonic for the ARM platform architecture due to version upgrades of the redis-tools and redis-server packages.
Modify select Dockerfile templates to download the redis-tools and redis-server packages from sonicstorage rather than from debian.org.
This PR has been made possible by the inclusion of ARM versions of redis-tools and redis-server into sonicstorage as described in Issue# 5701
* This was a temporary fix for orchagent spamming log messages and causing rate limiting, leading to critical messages being dropped for the syslog. No longer needed since Azure/sonic-sairedis#680 was merged.
On Arista platforms, sonic_platform packages are not installed in the PMon container, but are rather mounted into the container from the host OS. Therefore, pip show sonic_platform will fail in the PMon container. This change will first check if we can import sonic_platform. If this fails, it will then fall back to checking if the package is installed. If both fail, it will attempt to install the package.
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
As part of the transition from Python 2 to Python 3, we are installing both pip2 and pip3 in the slave and config-engine containers. This PR replaces calls to `pip` in these containers with an explicit call to `pip2` to ensure the proper version of pip is executed, no matter which version of pip is aliased to `pip`, as we no longer rely on that alias.
Also some other pip-related cleanup
Upgrading pip3 after pip2 caused the pip command to be aliased to the pip3 command. However, since we are still transitioning from Python 2 to Python 3, most pip commands in the codebase are expecting pip to alias to pip2. The proper solution here is to explicitly call pip2 and pip3, and no longer call pip, however this will require extensive changes and testing, so to quickly fix this issue, we upgraded pip2 after pip3 to ensure that pip2 is installed after pip3.
* Initial commit for BGP internal neighbor table support.
> Add new template named "internal" for the internal BGP sessions
> Add a new table in database "BGP_INTERNAL_NEIGHBOR"
> The internal BGP sessions will be stored in this new table "BGP_INTERNAL_NEIGHBOR"
* Changes in template generation tests with the introduction of internal neighbor template files.
* Fix for LLDP advertisments being sent with wrong information.
Since lldpd is starting before lldpmgr, some advertisment packets might sent with default value, mac address as Port ID.
This fix hold the packets from being sent by the lldpd until all interfaces are well configured by the lldpmgrd.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
* Fix comments
* Fix unit-test output caused a failure during build
* Add 'run_cmd' function and use it
* Resume lldpd even if port init timeout reached
The orchagent and syncd need to have the same default synchronous mode configuration. This PR adds a template file to translate the default value in CONFIG_DB (empty field) to an explicit mode so that the orchagent and syncd could have the same default mode.
**- Why I did it**
To introduce dynamic support of BBR functionality into bgpcfgd.
BBR is adding `neighbor PEER_GROUP allowas-in 1' for all BGP peer-groups which points to T0
Now we can add and remove this configuration based on CONFIG_DB entry
**- How I did it**
I introduced a new CONFIG_DB entry:
- table name: "BGP_BBR"
- key value: "all". Currently only "all" is supported, which means that all peer-groups which points to T0s will be updated
- data value: a dictionary: {"status": "status_value"}, where status_value could be either "enabled" or "disabled"
Initially, when bgpcfgd starts, it reads initial BBR status values from the [constants.yml](https://github.com/Azure/sonic-buildimage/pull/5626/files#diff-e6f2fe13a6c276dc2f3b27a5bef79886f9c103194be4fcb28ce57375edf2c23cR34). Then you can control BBR status by changing "BGP_BBR" table in the CONFIG_DB (see examples below).
bgpcfgd knows what peer-groups to change fron [constants.yml](https://github.com/Azure/sonic-buildimage/pull/5626/files#diff-e6f2fe13a6c276dc2f3b27a5bef79886f9c103194be4fcb28ce57375edf2c23cR39). The dictionary contains peer-group names as keys, and a list of address-families as values. So when bgpcfgd got a request to change the BBR state, it changes the state only for peer-groups listed in the constants.yml dictionary (and only for address families from the peer-group value).
**- How to verify it**
Initially, when we start SONiC FRR has BBR enabled for PEER_V4 and PEER_V6:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
neighbor PEER_V4 allowas-in 1
neighbor PEER_V6 allowas-in 1
```
Then we apply following configuration to the db:
```
admin@str-s6100-acs-1:~$ cat disable.json
{
"BGP_BBR": {
"all": {
"status": "disabled"
}
}
}
admin@str-s6100-acs-1:~$ sonic-cfggen -j disable.json -w
```
The log output are:
```
Oct 14 18:40:22.450322 str-s6100-acs-1 DEBUG bgp#bgpcfgd: Received message : '('all', 'SET', (('status', 'disabled'),))'
Oct 14 18:40:22.450620 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-f', '/tmp/tmpmWTiuq']'.
Oct 14 18:40:22.681084 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V4 soft in']'.
Oct 14 18:40:22.904626 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V6 soft in']'.
```
Check FRR configuraiton and see that no allowas parameters are there:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
admin@str-s6100-acs-1:~$
```
Then we apply enabling configuration back:
```
admin@str-s6100-acs-1:~$ cat enable.json
{
"BGP_BBR": {
"all": {
"status": "enabled"
}
}
}
admin@str-s6100-acs-1:~$ sonic-cfggen -j enable.json -w
```
The log output:
```
Oct 14 18:40:41.074720 str-s6100-acs-1 DEBUG bgp#bgpcfgd: Received message : '('all', 'SET', (('status', 'enabled'),))'
Oct 14 18:40:41.074720 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-f', '/tmp/tmpDD6SKv']'.
Oct 14 18:40:41.587257 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V4 soft in']'.
Oct 14 18:40:42.042967 str-s6100-acs-1 DEBUG bgp#bgpcfgd: execute command '['vtysh', '-c', 'clear bgp peer-group PEER_V6 soft in']'.
```
Check FRR configuraiton and see that the BBR configuration is back:
```
admin@str-s6100-acs-1:~$ vtysh -c 'show run' | egrep 'PEER_V.? allowas'
neighbor PEER_V4 allowas-in 1
neighbor PEER_V6 allowas-in 1
```
*** The test coverage ***
Below is the test coverage
```
---------- coverage: platform linux2, python 2.7.12-final-0 ----------
Name Stmts Miss Cover
----------------------------------------------------
bgpcfgd/__init__.py 0 0 100%
bgpcfgd/__main__.py 3 3 0%
bgpcfgd/config.py 78 41 47%
bgpcfgd/directory.py 63 34 46%
bgpcfgd/log.py 15 3 80%
bgpcfgd/main.py 51 51 0%
bgpcfgd/manager.py 41 23 44%
bgpcfgd/managers_allow_list.py 385 21 95%
bgpcfgd/managers_bbr.py 76 0 100%
bgpcfgd/managers_bgp.py 193 193 0%
bgpcfgd/managers_db.py 9 9 0%
bgpcfgd/managers_intf.py 33 33 0%
bgpcfgd/managers_setsrc.py 45 45 0%
bgpcfgd/runner.py 39 39 0%
bgpcfgd/template.py 64 11 83%
bgpcfgd/utils.py 32 24 25%
bgpcfgd/vars.py 1 0 100%
----------------------------------------------------
TOTAL 1128 530 53%
```
**- Which release branch to backport (provide reason below if selected)**
- [ ] 201811
- [x] 201911
- [x] 202006
use correct chassisdb.conf path while bringing up chassis_db service on VoQ modular switch.chassis_db service on VoQ modular switch.
resolves#5631
Signed-off-by: Honggang Xu <hxu@arista.com>
There is currently a bug where messages from swss with priority lower than the current log level are still being counted against the syslog rate limiting threshhold. This leads to rate-limiting in syslog when the rate-limiting conditions have not been met, which causes several sonic-mgmt tests to fail since they are dependent on LogAnalyzer. It also omits potentially useful information from the syslog. Only rate-limiting messages of level INFO and lower allows these tests to pass successfully.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
bring up chassisdb service on sonic switch according to the design in
Distributed Forwarding in VoQ Arch HLD
Signed-off-by: Honggang Xu <hxu@arista.com>
**- Why I did it**
To bring up new ChassisDB service in sonic as designed in ['Distributed forwarding in a VOQ architecture HLD' ](90c1289eaf/doc/chassis/architecture.md).
**- How I did it**
Implement the section 2.3.1 Global DB Organization of the VOQ architecture HLD.
**- How to verify it**
ChassisDB service won't start without chassisdb.conf file on the existing platforms.
ChassisDB service is accessible with global.conf file in the distributed arichitecture.
Signed-off-by: Honggang Xu <hxu@arista.com>
We were building our own python-click package because we needed features/bug fixes available as of version 7.0.0, but the most recent version available from Debian was in the 6.x range.
"Click" is needed for building/testing and installing sonic-utilities. Now that we are building sonic-utilities as a wheel, with Click specified as a dependency in the setup.py file, setuptools will install a more recent version of Click in the sonic-slave-buster container when building the package, and pip will install a more recent version of Click in the host OS of SONiC when installing the sonic-utilities package. Also, we don't need to worry about installing the Python 2 or 3 version of the package, as the proper one will be installed as necessary.
**- Why I did it**
FRR introduced [next hop tracking](http://docs.frrouting.org/projects/dev-guide/en/latest/next-hop-tracking.html) functionality.
That functionality requires resolving BGP neighbors before setting BGP connection (or explicit ebgp-multihop command). Sometimes (BGP MONITORS) our neighbors are not directly connected and sessions are IBGP. In this case current configuration prevents FRR to establish BGP connections. Reason would be "waiting for NHT". To fix that we need either add static routes for each not-directly connected ibgp neighbor, or enable command `ip nht resolve-via-default`
**- How I did it**
Put `ip nht resolve-via-default` into the config
**- How to verify it**
Build an image. Enable BGP_MONITOR entry and check that entry is Established or Connecting in FRR
Co-authored-by: Pavel Shirshov <pavel.contrib@gmail.com>
* Install ndppd during image build, and copy config files to image
* Configure proxy settings based on config DB at container start
* Pipe ndppd output to logger inside container to log output in syslog
implements a new feature: "BGP Allow list."
This feature allows us to control which IP prefixes are going to be advertised via ebgp from the routes received from EBGP neighbors.
Jinja2 templates rendered using Python 3 interpreter, are required
to conform with Python 3 new semantics.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
implements a new feature: "BGP Allow list."
This feature allows us to control which IP prefixes are going to be advertised via ebgp from the routes received from EBGP neighbors.
* buildimage: Add gearbox phy device files and a new physyncd docker to support VS gearbox phy feature
* scripts and configuration needed to support a second syncd docker (physyncd)
* physyncd supports gearbox device and phy SAI APIs and runs multiple instances of syncd, one per phy in the device
* support for VS target (sonic-sairedis vslib has been extended to support a virtual BCM81724 gearbox PHY).
HLD is located at b817a12fd8/doc/gearbox/gearbox_mgr_design.md
**- Why I did it**
This work is part of the gearbox phy joint effort between Microsoft and Broadcom, and is based
on multi-switch support in sonic-sairedis.
**- How I did it**
Overall feature was implemented across several projects. The collective pull requests (some in late stages of review at this point):
https://github.com/Azure/sonic-utilities/pull/931 - CLI (merged)
https://github.com/Azure/sonic-swss-common/pull/347 - Minor changes (merged)
https://github.com/Azure/sonic-swss/pull/1321 - gearsyncd, config parsers, changes to orchargent to create gearbox phy on supported systems
https://github.com/Azure/sonic-sairedis/pull/624 - physyncd, virtual BCM81724 gearbox phy added to vslib
**- How to verify it**
In a vslib build:
root@sonic:/home/admin# show gearbox interfaces status
PHY Id Interface MAC Lanes MAC Lane Speed PHY Lanes PHY Lane Speed Line Lanes Line Lane Speed Oper Admin
-------- ----------- --------------- ---------------- --------------- ---------------- ------------ ----------------- ------ -------
1 Ethernet48 121,122,123,124 25G 200,201,202,203 25G 204,205 50G down down
1 Ethernet49 125,126,127,128 25G 206,207,208,209 25G 210,211 50G down down
1 Ethernet50 69,70,71,72 25G 212,213,214,215 25G 216 100G down down
In addition, docker ps | grep phy should show a physyncd docker running.
Signed-off-by: syd.logan@broadcom.com
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
SWSS config script restore ARP/FDB/Routes. Restore neighbor script
uses config DB ARP information to restore ARP entries and so needs
to be started after swssconfig exits.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Install a newer version of rsyslog from stretch-backports to support -iNONE
Previous backport from master use -iNONE option which is only
available after v8.32.0
Signed-off-by: Guohan Lu <lguohan@gmail.com>
We are moving toward building all Python packages for SONiC as wheel packages rather than Debian packages. This will also allow us to more easily transition to Python 3.
Python files are now packaged in "sonic-utilities" Pyhton wheel. Data files are now packaged in "sonic-utilities-data" Debian package.
**- How I did it**
- Build and install sonic-utilities as a Python package
- Remove explicit installation of wheel dependencies, as these will now get installed implicitly by pip when installing sonic-utilities as a wheel
- Build and install new sonic-utilities-data package to install data files required by sonic-utilities applications
- Update all references to sonic-utilities scripts/entrypoints to either reference the new /usr/local/bin/ location or remove absolute path entirely where applicable
Submodule updates:
* src/sonic-utilities aa27dd9...2244d7b (5):
> Support building sonic-utilities as a Python wheel package instead of a Debian package (#1122)
> [consutil] Display remote device name in show command (#1120)
> [vrf] fix check state_db error when vrf moving (#1119)
> [consutil] Fix issue where the ConfigDBConnector's reference is missing (#1117)
> Update to make config load/reload backward compatible. (#1115)
* src/sonic-ztp dd025bc...911d622 (1):
> Update paths to reflect new sonic-utilities install location, /usr/local/bin/ (#19)
* Add bgpmon under sonic-bgpcfgd to be started as a new daemon under BGP docker
* Added bgpmon to be monitored by Monit so that if it crashed, it gets alerted
* use console_scripts entry point to package bgpmon
This PR limited the number of calls to sonic-cfggen to one call
per iteration instead of current 3 calls per iteration.
The PR also installs jq on host for future scripts if needed.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Arp update process was not being started due to an issue with
the directory name having an extra 'd' in supervisor as in
'/etc/supervisord/conf.d/arp_update.conf'.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* Support for multi-asic platform for swssloglevel command
admin@str-acs-1:~$ swssloglevel
Usage: /usr/bin/swssloglevel -n [0 to 3] [OPTION]...
* Update to use the env file to get the PLATFORM string.
Printing both snapshot and current counter sets will make it easier to pinpoint
which message type(s) is/are not being relayed. This PR prints both counter sets.
Also, this PR defines gnu11 as a C standard to compile with in order to avoid
making changes when porting to 201811 branch.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
When BGP routes are missing, DHCP packets get relayed over mgmt
interface. This results in dhcpmon alerting that DHCP packets are
not being relayed. This is PR include mgmt interface as uplink
device, and so, if DHCP packet gets relayed over mgmt interface,
regular dhcpmon alert will not be issues. Instead, dhcpmon will
check the mgmt interface counts and issue a separate alert regarding
packets travelling through mgmt network.
In addition, this PR includes the following enhancements:
1. Add SIGUSR1 handler that prints out current packet counts
2. Increase alert grace window to 3 minutes from currently 2 minutes
3. Time is now computed more accurately
4. Print vlan name before counters
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
* [redis] Use redis-server and redis-tools in blob storage to prevent
upstream link broken
* Use curl instead of wget
* Explicitly install dependencies
https://github.com/Azure/sonic-buildimage/issues/5255
Root Cause: Waiting on Restore count != 0 can lead to race condition
between orchagent process and swssconfig.sh.
Ideally check of Restore count != 0 is not needed as the State DB
cannot be flushed as if it was flushed then Warm Restart or swss-restart
should not be true also.
Remove radvd Makefile and patch, change docker-router-advertiser Dockerfile template to simply install the vanilla radvd package using apt-get.
- In PR https://github.com/Azure/sonic-buildimage/pull/2795, we started building radvd from source and patching it to prevent it from erroring out when advertising an MTU of 9100 which was greater than the MTU size configured on the bridge interface (1500), which was due to a limitation in the 4.9 Linux kernel.
- Master branch is now using Linux kernel 4.19. As of 4.18, the kernel supports setting a bridge MTU to a value > 1500.
- PR https://github.com/Azure/sonic-swss/pull/1393 modified vlanmgrd to take advantage of this and now configures the MTU of bridge interfaces in SONiC to the proper size of 9100. Therefore, we no longer need to patch radvd. Since we no longer need to patch radvd, we no longer need to build it from source, so we can save build time by going back to simply installing the vanilla radvd Debian package in the router-advertiser container.
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
The following changes are done.
- Multi asic platform have 2 Loopback interfaces, Loopback0 and Loopback4096. IPinIP decap entries need to be added for both of them. Update the ipinip.json.j2 template to add decap entries for Loopback4096.
- Add corressponding unit test
Add a master switch so that the sync/async mode can be configured.
Example usage of the switch:
1. Configure mode while building an image
`make ENABLE_SYNCHRONOUS_MODE=y <target>`
2. Configure when the device is running
Change CONFIG_DB with `sonic-cfggen -a '{"DEVICE_METADATA":{"localhost": {"synchronous_mode": "enable"}}}' --write-to-db`
Restart swss with `systemctl restart swss`
**- Why I did it**
PR https://github.com/Azure/sonic-buildimage/pull/4599 introduced two bugs in the startup of the router advertiser container:
1. References to the `wait_for_intf.sh` script were changed to `wait_for_link.sh`, but the actual script was not renamed
2. The `ipv6_found` Jinja2 variable added to the supervisor config file goes out of scope before it is read.
**- How I did it**
1. Rename the `wait_for_intf.sh` script to `wait_for_link.sh`
2. Use the Jinja2 "namespace" construct to fix the scope issue
**- How to verify it**
Ensure all processes in the radv container start properly under the correct conditions (i.e., whether or not there is at least one VLAN with an IPv6 address assigned).
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting radv service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting swss service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to two calls during startup when starting frr service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to one call during startup when starting dhcp-relay
service.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to once calle during snmp startup
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future. ledd and psud were converted in earlier PRs. This PR converts the remainder:
- pcied
- syseepromd
- thermalctld
- xcvrd
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future.
As part of migrating all Python-based package installers to wheel format rather than Debian packages. Also to allow for easily building a Python 3 version of the package in the near future.
- Also remove some references to sonic-daemon-base which I previously missed and add missing sonic-py-common dependency for sonic-pcied.
* [platform] Add Support For Environment Variable
This PR adds the ability to read environment file from /etc/sonic.
the file contains immutable SONiC config attributes such as platform,
hwsku, version, device_type. The aim is to minimize calls being made
into sonic-cfggen during boot time.
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* Changes to add template support for copp.json.
This is needed so that we can install differnt type of
Traps based on Device Role (Tor/Leaf/Mgmt/etc...).
Initial use case is to install DHCP/DHCPv6 tarp only
for tor router.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fixed based on review comments.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fixed based on review comment.
Verify that /etc/apt/sources.list points to buster using docker exec bgp cat /etc/apt/sources.list
BGP neighborship is established.
root@sonic:~# show ip bgp summary
IPv4 Unicast Summary:
BGP router identifier 10.1.0.1, local AS number 65100 vrf-id 0
BGP table version 1
RIB entries 1, using 184 bytes of memory
Peers 1, using 20 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
6.1.1.1 4 100 96 96 0 0 0 01:32:04 0
Total number of neighbors 1
root@sonic:~#
Signed-off-by: Joyas Joseph <joyas_joseph@dell.com>
Copy proper fancontrol config file to the proper destination. Also some minor refactoring for code reuse to help prevent issues like this in the future.
Fixes a bug introduced by #4599
fixes https://github.com/Azure/sonic-buildimage/issues/5026
Explanation:
In the log from the issue I found:
```
I see following in the log
Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
```
Analyzing source code I found that the error message could be issues only when `zclient_send_rnh()` return less than 0.
```
ret = zclient_send_rnh(zclient, command, p, exact_match,
bnc->bgp->vrf_id);
/* TBD: handle the failure */
if (ret < 0)
flog_warn(EC_BGP_ZEBRA_SEND,
"sendmsg_nexthop: zclient_send_message() failed");
```
I checked [zclient_send_rnh()](88351c8f6d/lib/zclient.c (L654)) and found that this function will return the exit code which the function gets from [zclient_send_message()](88351c8f6d/lib/zclient.c (L266)) But the latter function could return not 0 in two cases:
1. bgpd didn’t connect to the zclient socket yet [code](88351c8f6d/lib/zclient.c (L269))
2. The socket was closed. But in this case we would receive the error message in the log. (And I can find the message in the log when we reboot sonic) [code](88351c8f6d/lib/zclient.c (L277))
Also I see from the logs that client connection was set later we had the issue in bgpd.
Bgpd.log
```
Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
```
Vs
Zebra.log
```
Jul 22 21:13:12.713249 vlab-01 NOTICE bgp#zebra[48]: client 25 says hello and bids fair to announce only static routes vrf=0
Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 30 says hello and bids fair to announce only bgp routes vrf=0
Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 33 says hello and bids fair to announce only vnc routes vrf=0
```
So in our case we should start zebra first. Wait until it is started and then start bgpd and other daemons.
**- How I did it**
I changed a graph to start daemons in the following order:
1. First start zebra
2. Then starts staticd and bgpd
3. Then starts vtysh -b and bgpeoi after bgpd is started.
Optimizing number of calls made to sonic-cfggen during service
start up as it adds to total system boot up time.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
sonic-cfggen call is slow and it adds to system start up time
**- How I did it**
places all required variable into single template and called into sonic-cfggen using this template
**- How to verify it**
***-Test 1***
there is an average saving of .5 to 1 sec between old script and new script
```
root@str-s6000-acs-14:/# time ./orchagent_old.sh
/usr/bin/orchagent -d /var/log/swss -b 8192 -m f4:8e:38:16:bc:8d
real 0m3.546s
user 0m2.365s
sys 0m0.585s
root@str-s6000-acs-14:/# time ./orchagent_new.sh
/usr/bin/orchagent -d /var/log/swss -b 8192 -m f4:8e:38:16:bc:8d
real 0m2.058s
user 0m1.650s
sys 0m0.363s
```
***-Test 2***
Built an image with this change and orchagent is running with intended params:
```
admin@str-s6000-acs-14:~$ ps -ef | grep orchagent
root 2988 1901 1 02:09 pts/0 00:00:02 /usr/bin/orchagent -d /var/log/swss -b 8192 -m f4:8e:38:16:bc:8d
```
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Optimizing number of calls made to sonic-cfggen during service
start up as it adds to total system boot up time.
***-Test 1***
there is an average saving of 1 to 1.5 sec between old script and new script
```
root@str-s6000-acs-14:/# time /usr/bin/rest-server-old.sh
Generating temporary TLS server certificate ...
2020/07/09 19:03:33 wrote cert.pem
2020/07/09 19:03:33 wrote key.pem
REST_SERVER_ARGS = -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
/usr/sbin/rest_server -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
real 0m8.790s
user 0m7.993s
sys 0m0.584s
root@str-s6000-acs-14:/# time /usr/bin/rest-server-new.sh
Generating temporary TLS server certificate ...
2020/07/09 19:03:45 wrote cert.pem
2020/07/09 19:03:45 wrote key.pem
REST_SERVER_ARGS = -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
/usr/sbin/rest_server -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
real 0m6.940s
user 0m5.670s
sys 0m0.386s
```
***-Test 2***
Built an image with this change and rest server is running with params as described in test 1 above
```
admin@str-s6000-acs-14:~$ ps -ef | grep rest_server
root 3301 2866 2 02:09 pts/0 00:00:10 /usr/sbin/rest_server -ui /rest_ui -logtostderr -cert /tmp/cert.pem -key /tmp/key.pem
```
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
For telemetry regression test we need gnmi client to be present on ptfdocker. Gnmi-server will be present on SONiC DuT. Further, we can access gnmi_get from ptfdocker inside pytest to verify gnmi server streaming data successfully or not.
The template is referenced relative to the script path and this could
results in errors in case script is run from root. Add explicit
path to the template file name.
Also, moving telemetry_var template to template dir.
And remove double quotes from around json dict.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
* [mgmt docker] move pycryptodome installation to the end of the docker building
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* pin down the version to current: 3.9.8
* comment
Signed-off-by: Akhilesh Samineni <akhilesh.samineni@broadcom.com>
All new NAT conntrack entries are added to kernel with max entry timeout of 432000 and setting the same timeout during system warm reboot also
sonic-cfggen call is slow and this is taking place in the SONiC
boot up process. The change uses templates to assemble all required
vars into single template file. With this change, telemetry now calls
once into sonic-cfggen.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Resubmitting the changes for (#4825) with fixes for sonic-bgpcdgd test failures
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
also update submodule
* 01f810f 2020-07-02 | fix compiling issue for gcc8.3 (#1339) [lguohan]
* 9b13120 2020-07-03 | Fix in script to avoid orchagent crash when port down followed by fdb delete (#1340) [rupesh-k]
* 9b01844 2020-07-01 | [qosorch] Update QoS scheduler params for shaping features (#1296) [Michael Li]
* 86b5e99 2020-07-02 | [mirrororch] Port Mirroring implementation (#1314) [rupesh-k]
* c05601c 2020-06-24 | [portsyncd]: add debug message if a port cannot be found in port able (#1328) [lguohan]
* a0b6412 2020-06-23 | COPP_DEL_fix: DEL for one trap group from SONIC is resetting all the trap IDs (#1273) [SinghMinu]
Signed-off-by: Guohan Lu <lguohan@gmail.com>
* Loopback IP changes for multi ASIC devices
multi ASIC will have 2 Loopback Interfaces
- Loopback0 has globally unique IP address, which is advertised by the multi ASIC device to its peers.
This way all the external devices will see this device as a single device.
- Loopback4096 is assigned an IP address which has a scope is within the device. Each ASIC has a different ip address for Loopback4096. This ip address will be used as Router-Id by the bgp instance on multi ASIC devices.
This PR implements this change for multi ASIC devices
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
* [sonic-buildimage] Changes to make network specific sysctl
common for both host and docker namespace (in multi-npu).
This change is triggered with issue found in multi-npu platforms
where in docker namespace
net.ipv6.conf.all.forwarding was 0 (should be 1) because of
which RS/RA message were triggered and link-local router were learnt.
Beside this there were some other sysctl.net.ipv6* params whose value
in docker namespace is not same as host namespace.
So to make we are always in sync in host and docker namespace
created common file that list all sysctl.net.* params and used
both by host and docker namespace. Any change will get applied
to both namespace.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Address Review Comments and made sure to invoke augtool
only one and do string concatenation of all set commands
* Address Review Comments.
* Support for connecting to DB in namespace via IP:port ( using docker bridge network ) for applications in multi-asic platform.
* Added the default IP as 127.0.0.1 if the IPaddress derivation from interface fails.
Moved the localhost loopback IP binding logic into the supervisor.j2 file.
* Changes to make default route programming
correct in multi-asic platform where frr is not running
in host namespace. Change is to set correct administrative distance.
Also make NAMESPACE* enviroment variable available for all dockers
so that it can be used when needed.
Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
* Fix review comments
* Review comment to check to add default route
only if default route exist and delete is successful.
The program name in critical_processes file must match the program name defined in supervisord.conf file.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
The current stdout file which also includes the dut logs are very verbose and noisy.
We have manually installed it in the sonic-mgmt docker in our organization and tuned the pytest settings to produce very helpful and concise logs.
pytest-html plugins can be used to post-process the output in various ways based on our different and unique organizational needs.
Hence proposing to add this pkt to the docker file
when portsyncd starts, it first enumerates all front panel ports
and marks them as old interfaces. Then, for new front panel ports
it checks if their indexes exist in previous sets. If yes, it will
treats them as old interfaces and ignore them.
The reason we have this check is because broadcom SAI only removes
front panel ports after sai switch init.
So, if portsyncd starts after orchagent, new interfaces could be
created before portsyncd and treated as old interface.
Signed-off-by: Guohan Lu <lguohan@gmail.com>
FDB/ARP/Default routes files are deleted after swssconfig. This
makes debugging/validation of device conversion hard. This PR
saves those files in order to facilitate debugging of device conversion.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
REST and telemetry servers were using "DEVICE_METADATA|x509" table for
server certificate configurations. This table has been deprecated now.
Enhanced REST server startup script to read server certificate file
path configurations from REST_SERVER table. Three more attributes -
server_crt, server_key and ca_crt are introduced as described in
https://github.com/Azure/SONiC/pull/550.
For backard compatibility, certificate configurations are read from
old "DEVICE_METADATA|x509" table if they (server_crt, server_key and
ca_crt) are not present in REST_SERVER table.
Fixes bug https://github.com/Azure/sonic-buildimage/issues/4291
Signed-off-by: Sachin Holla <sachin.holla@broadcom.com>
The `-sv2` suffix was used to differentiate SNMP Dockers when we transitioned from "SONiCv1" to "SONiCv2", about four years ago. The old Docker materials were removed long ago; there is no need to keep this suffix. Removing it aligns the name with all the other Dockers.
Also edit Monit configuration to detect proper snmp-subagent command line in Buster, and make snmpd command line matching more robust.
The -sv2 suffix was used to differentiate SNMP Dockers when we transitioned from "SONiCv1" to "SONiCv2", about four years ago. The old Docker materials were removed long ago; there is no need to keep this suffix. Removing it aligns the name with all the other Dockers.
**- Why I did it**
We need RIF counters to be enabled by default. Flex Counter does probe for supported counters. If a platform does not support RIF counters, SAI will return NOT_SUPPORTED and Flex Counter will stop polling the counter.
**- How to verify it**
After fresh install rif counter gropup is enabled by default:
$ counterpoll show
Type Interval (in ms) Status
-------------------- ------------------ --------
QUEUE_STAT default (10000) enable
PORT_STAT default (1000) enable
RIF_STAT default (1000) enable
QUEUE_WATERMARK_STAT default (10000) enable
PG_WATERMARK_STAT default (10000) enable
Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>
* Adding new BGP peer groups PEER_V4_INT and PEER_V6_INT. The internal BGP sessions
will be added to this peer group while the external BGP sessions will be added
to the exising PEER_V4 and PEER_V6 peer group.
* Check for "ASIC" keyword in the hostname to identify the internal neighbors.
lldpmgrd listens for changes to the PORT table in the CONFIG_DB and APP_DB in order to handle alias/description config change. It checks if port is up or down by looking into the oper-status for in APP_DB PORT TABLE. If it cannot find it in the App DB, it will log error.
During initializing, it is possible that there is a port change in CONFIG_DB and but the not ready in APP_DB.
The change here is to only log error in is_port_up() after port init done.