Commit Graph

810 Commits

Author SHA1 Message Date
Guohan Lu
bed4c26b09 Revert "Add ethtool to docker-platform-monitor (#8017)"
This reverts commit d66425dd76.
2021-07-07 23:37:28 -07:00
VenkatCisco
d66425dd76 Add ethtool to docker-platform-monitor (#8017)
#### Why I did it
ethtool can be used to query and change settings such as speed, auto- negotiation and checksum offload on many network devices, especially Ethernet devices. 

#### How I did it
add package extension to docker-platform-monitor/Dockerfile.j2
2021-07-07 09:40:11 +00:00
VenkatCisco
36d7dfbea3 Add libpci3 pkg to docker-platform-monitor (#8016)
#### Why I did it
The libpci library provides portable access to configuration registers of devices connected to the PCI bus.

#### How I did it
update dockers/docker-platform-monitor/Dockerfile.j2
2021-07-07 09:40:06 +00:00
thomas.cappleman@metaswitch.com
1d3e7ab161 [build]: Fix sonic-cfggen contextlib err (#7996)
A recent version of contextlib2 (https://pypi.org/project/contextlib2/21.6.0/#history) has broken Python2 compatibility,
so the version picked up by netaddr when using Python2 must be specified, or else builds fail

Co-authored-by: Tom Zhu <tom.zhu@metaswitch.com>
2021-06-28 17:18:45 -07:00
Andriy Yurkiv
2fe91ae30f Set default values only on the first start (#7735) 2021-06-16 12:38:30 +00:00
bingwang-ms
c1b380df73
[docker-teamd]: Increase teammgrd timeout to allow graceful shutdown. (#7662) (#7842)
The PR is a cherry-pick of #7662.

Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
2021-06-10 12:49:18 -07:00
yozhao101
fb2c995f53
[202012][Monit] Deprecate the feature of monitoring the critical processes by Monit (#7823)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.

How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.

How to verify it
I verified this on the device str-7260cx3-acs-1.
2021-06-09 09:04:22 -07:00
Myron Sosyak
e7009513da [docker-database] Fix Python3 issue (#7700)
#### Why I did it
To avoid the following error
```
Traceback (most recent call last):
  File "/usr/local/bin/flush_unused_database", line 10, in <module>
    if 'PONG' in output:
TypeError: a bytes-like object is required, not 'str'
```
`communicate` method returns the strings if streams were opened in text mode; otherwise, bytes.
In our case text arg  in Popen is not true and that means that `communicate` return the bytes
#### How I did it
Set `text=True` to get strings instead of bytes
#### How to verify it
run `/usr/local/bin/flush_unused_database` inside database container
2021-06-02 02:39:31 +00:00
bingwang-ms
eb8c05c306 Fix lldpmgrd syntax issue (#7742)
Signed-off-by: bingwang <bingwang@microsoft.com>
2021-06-02 02:39:31 +00:00
Lawrence Lee
6a0e9078d4 [docker-orchagent]: Increase ndppd kernel poll interval (#7456)
Why I did it
ndppd by default reads /proc/net/ipv6_route ever 30 seconds. Since T1s advertise so many routes to ToRs, this file is extremely large, and reading it causes ndppd's CPU usage to spike every 30 seconds

How I did it
Increase the delay for reading this file to the maximum possible value (max integer value), which will result in CPU spikes every ~24 days instead of every 30 seconds

How to verify it
Start ndppd with the new config file, confirm that no CPU spikes are seen except at startup

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
2021-06-02 02:38:54 +00:00
yozhao101
3af05fdffe [Monit] Restart telemetry container if memory usage is beyond the threshold (#7645)
Signed-off-by: Yong Zhao yozhao@microsoft.com

Why I did it
This PR aims to monitor the memory usage of streaming telemetry container and restart streaming telemetry container if memory usage is larger than the pre-defined threshold.

How I did it
I borrowed the system tool Monit to run a script memory_checker which will periodically check the memory usage of streaming telemetry container. If the memory usage of telemetry container is larger than the pre-defined threshold for 10 times during 20 cycles, then an alerting message will be written into syslog and at the same time Monit will run the script restart_service to restart the streaming telemetry container.

How to verify it
I verified this implementation on device str-7260cx3-acs-1.
2021-05-31 04:38:18 +00:00
bingwang-ms
c5d27750f2 Fix supervisor-proc-exit-listener startup issue in restapi (#7681)
* Fix supervisor-proc-exit-listener startup issue in restapi

Signed-off-by: bingwang <bingwang@microsoft.com>
2021-05-27 22:29:42 +00:00
LuiSzee
bc5b367d37 [radv] fix bug for radv can't startup if DEVICE_METADATA.localhost.type is NULL (#7651)
Co-authored-by: Shi Lei <shil@centecnetworks.com>
2021-05-26 02:39:02 +00:00
Myron Sosyak
4ec13fd556 Fix python version (#7658)
#### Why I did it
To avoid the following logs 
```
Mar 15 15:52:04.599302 igk-dut-04 INFO database#/supervisord: flushdb /bin/bash: /usr/local/bin/flush_unused_database: /usr/bin/python: bad interpreter: No such file or directory
Mar 15 15:52:04.599947 igk-dut-04 INFO database#supervisord 2021-03-15 15:52:04,599 INFO exited: flushdb (exit status 126; not expected)
```

#### How I did it
Fix  shebang
#### How to verify it
Check the logs
2021-05-24 22:25:47 +00:00
xumia
2973a63f1d Fix the type issue in rvtysh (#7648)
Why I did it
Change the type issue in the command rvtysh
change PARA/para to PARAM/param
2021-05-24 22:25:47 +00:00
sudhanshukumar22
0a5551aabf docker-lldp:intermittent DB errors will result in Client termination (#6119)
This PR allows listen to hostname changes and mgmt ip changes.
2021-05-24 22:21:29 +00:00
abdosi
dbded1f48e Changes in FRR temapltes for multi-asic (#6901)
1. Made the command next-hop-self force only applicable on back-end asic bgp. This is done so that BGPL iBGP session running on backend can send e-BGP learn nexthop. Back end asic FRR is able to recursively resolve the eBGP nexthop in its routing table since it knows about all the connected routes advertise from front end asic.

2. Made all front-end asic bgp use global loopback ip (Loopback0) as router id and back end asic bgp use Loopbacl4096 as ruter-id and originator id for Route-Reflector. This is done so that routes learnt by external peer do not see Loopback4096 as router id in show ip bgp <route-prerfix> output.

3. To handle above change need to pass Loopback4096 from BGP manager for jinja2 template generation. This was missing and this change/fix is needed for this also https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-fpm-frr/frr/bgpd/templates/dynamic/instance.conf.j2#L27

4. Enhancement to add mult_asic specific bgpd template generation unit test cases.
2021-05-24 21:59:57 +00:00
abdosi
8f6b3456ab [multi-asic] BBR support on internal-peers for multi-asic platfroms. (#6848)
Enable BBR config allowas-in 1 for internal peers

Why I did:
To advertise BBR routes learnt via e-BGP peer in one asic/namespace to another iBGP asic/namespace via Route Reflector.
2021-05-24 21:57:10 +00:00
VenkatCisco
91b4ce649e [pmon]: add psmisc to bring fuser that dentifies processes that are using files or sockets (#7509)
fuser support is required since new cisco hardware watchdog plugin uses them to check anyone else use's /dev/watchdogX resource. The actual validation happens in the platform code, but the package is required for pmon container. Currently the /dev/watchdogX is being used by cisco platform-monitor service. Cisco chassis level watchdog plugin uses "fuser" to claim the watchdog release from platform-monitor service.
2021-05-10 16:00:43 -07:00
Junchao-Mellanox
6e12c40f40 [Mellanox] Support new sensor conf file for MSN4700 A1/A0 (#7535)
#### Why I did it

MSN4700 A1/A0 used different sensor chip but keep the existing platform name *x86_64-mlnx_msn4700-r0*, this is a workaround to replace the sensor conf on MSN4700 A1/A0

#### How I did it

Use a shell script to get the sensor conf path and copy that files to /etc/sensors.d/sensors.conf
2021-05-10 09:21:42 -07:00
trzhang-msft
d76206bae4 dhcpmon: support dual tor in docker template (#7470) 2021-05-05 09:34:42 -07:00
judyjoseph
8cea931cad Fixes for errors seen in staging devices (#7171)
With the latest 201911 image, the following error was seen on staging devices with TSB command ( for both single asic, multi asic ). Though this err message doesn't affect the TSB functionality, it is good to fix.

admin@STG01-0101-0102-01T1:~$ TSB
BGP0 : % Could not find route-map entry TO_TIER0_V4 20
line 1: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 permit 20
% Could not find route-map entry TO_TIER0_V4 30
line 2: Failure to communicate[13] to zebra, line: no route-map TO_TIER0_V4 deny 30

In addition, in this PR I am fixing the message displayed to user when there are no BGP neighbors configured on that BGP instance. In multi-asic device there could be case where there are no BGP neighbors configured on a particular ASIC.
2021-05-03 13:19:29 -07:00
judyjoseph
7ae4a990e7 [docker-fpm-frr]: TSA/B/C changes for multi-asic (#6510)
- Introduced TS common file in docker as well and moved common functions.
- TSA/B/C scripts run only in BGP instances for front end ASICs.
       In addition skip enforcing it on route maps used between internal BGP sessions.

admin@str--acs-1:~$ sudo /usr/bin/TSA
System Mode: Normal -> Maintenance

and in case of Multi-ASIC
admin@str--acs-1:~$ sudo /usr/bin/TSA
BGP0 : System Mode: Normal -> Maintenance
BGP1 : System Mode: Normal -> Maintenance
BGP2 : System Mode: Normal -> Maintenance
2021-05-03 13:19:17 -07:00
guxianghong
a0fde3a626 [arm] support compile sonic arm image on arm server (#7285)
- Support compile sonic arm image on arm server. If arm image compiling is executed on arm server instead of using qemu mode on x86 server, compile time can be saved significantly.
- Add kernel argument systemd.unified_cgroup_hierarchy=0 for upgrade systemd to version 247, according to #7228
- rename multiarch docker to sonic-slave-${distro}-march-${arch}

Co-authored-by: Xianghong Gu <xgu@centecnetworks.com>
Co-authored-by: Shi Lei <shil@centecnetworks.com>
2021-05-02 08:11:56 -07:00
xumia
1b05982727 Support readonly vtysh for sudoers (#7383)
Why I did it
Support readonly version of the command vtysh

How I did it
Check if the command starting with "show", and verify only contains single command in script.
2021-04-29 10:08:55 -07:00
kakkotetsu
e6bbb3c344 [restapi] fix python version during restapi startup (#7056)
changed from python3 to python in supervisord.conf.
2021-04-22 14:36:09 -07:00
Vivek Reddy Karri
731401fe4f Reverts the commit which reverts "Backport ethtool to support QSFP-DD (#5725)"
This reverts commit a86cdd87cf.
2021-04-15 19:15:58 -07:00
Stephen Sun
3cee45c298 [monit] Avoid monit error log by removing "-l" from monit_swss|buffermgrd (#7236)
Avoid the following error messages while dynamic buffer calculation is enabled
```
ERR monit[491]: 'swss|buffermgrd' status failed (1) -- '/usr/bin/buffermgrd -l' is not running in host
```

Change /usr/bin/buffermgrd -l to /usr/bin/buffermgrd. The buffermgrd is started by -l for traditional model or -a for dynamic model. So we need to use the common section of both.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
2021-04-08 18:39:10 +00:00
Prince Sunny
e08dc12acf [IPinIP] Add Loopback2 interface, change dscp mode to uniform (#7234)
Co-authored-by: Ubuntu <prsunny>
2021-04-08 18:38:59 +00:00
Guohan Lu
a86cdd87cf Revert "Backport ethtool to support QSFP-DD (#5725)"
This reverts commit 50e4cc1579.
2021-04-01 13:11:15 -07:00
Joe LeVeque
dd9be59cd1
[202012][dockers][supervisor] Increase event buffer size for process exit listener; Set all event buffer sizes to 1024 (#7203)
#### Why I did it

Backport of https://github.com/Azure/sonic-buildimage/pull/7083 to the 202012 branch.

To prevent error [messages](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802) like the following from being logged:

```
Mar 17 02:33:48.523153 vlab-01 INFO swss#supervisord 2021-03-17 02:33:48,518 ERRO pool supervisor-proc-exit-listener event buffer overflowed, discarding event 46
```

This is basically an addendum to https://github.com/Azure/sonic-buildimage/pull/5247, which increased the event buffer size for dependent-startup. While supervisor-proc-exit-listener doesn't subscribe to as many events as dependent-startup, there is still a chance some containers (like swss, as in the example above) have enough processes running to cause an overflow of the default buffer size of 10.

This is especially important for preventing erroneous log_analyzer failures in the sonic-mgmt repo regression tests, which have started occasionally causing PR check builds to fail. Example [here](https://dev.azure.com/mssonic/build/_build/results?buildId=2254&view=logs&j=9a13fbcd-e92d-583c-2f89-d81f90cac1fd&t=739db6ba-1b35-5485-5697-de102068d650&l=802).

I set all supervisor-proc-exit-listener event buffer sizes to 1024, and also updated all dependent-startup event buffer sizes to 1024, as well, to keep things simple, unified, and allow headroom so that we will not need to adjust these values frequently, if at all.
2021-04-01 12:52:19 -07:00
Shi Su
576864b769 [bgp]: Reduce bgp connect retry timer to 10 seconds (#7169)
The default bgp connect retry timer is 120 seconds. A reconnection will happen 120 seconds if the initial connection fails. This PR aims to allow a more frequent retry.
2021-03-31 08:49:23 -07:00
judyjoseph
902ad1357a To decrease the Connect Retry Timer from default value which is 120sec to 10 sec. (#7087)
Why I did it
It was observed that on a multi-asic DUT bootup, the BGP internal sessions between ASIC's was taking more time to get ESTABLISHED than external BGP sessions. The internal sessions was coming up almost exactly 120 secs later.

In multi-asic platform the bgp dockers ( which is per ASIC ) on switch start are bring brought up around the same time and they try to make the bgp sessions with neighbors (in peer ASIC's) which may be not be completely up. This results in BGP connect fail and the retry happens after 120sec which is the default Connect Retry Timer

How I did it
Add the command to set the bgp neighboring session retry timer to 10sec for internal bgp neighbors.
2021-03-26 17:39:50 +00:00
shlomibitton
50e4cc1579 Backport ethtool to support QSFP-DD (#5725)
Backport ethtool debian package version 5.9 to support QSFP-DD cable parsing.

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
2021-03-19 10:29:58 -07:00
trzhang-msft
1dec175743 [docker-dhcp-relay]: add -si support in dhcp docker template (#7053) 2021-03-15 19:20:28 -07:00
Qi Luo
97426aff5a [build]: Fix get-pip 2.7 url according to upstream announcement (#6999)
ref: https://bootstrap.pypa.io/2.7/get-pip.py

The URL you are using to fetch this script has changed, and this one will no
longer work. Please use get-pip.py from the following URL instead:

    https://bootstrap.pypa.io/pip/2.7/get-pip.py
2021-03-10 09:32:12 -08:00
Tamer Ahmed
7ec9fbb678 Start DHCP Relay When Helpers IPs Are Available (#6961)
#### Why I did it

It is possible to have DHCP relay configuration with no servers/
helpers which result in DHCP container to crash. This PR fixes this
issue by not starting DHCP relay for vlans with no DHCP helpers.

resolves: #6931 
closes: #6931 
#### How I did it
Do not add program group for dhcp relay with not dhcp helpers

#### How to verify it
Unit test
2021-03-10 09:23:56 -08:00
Qi Luo
414a942462 [radv] Disable radv for specific deployment_id (#6830) 2021-02-23 23:56:01 +00:00
Andriy Yurkiv
569686ed84 Enable SAI_INGRESS_PRIORITY_GROUP_STAT_DROPPED_PACKETS counter by default (#6444)
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
2021-02-23 23:56:01 +00:00
arlakshm
d7be5a021a [Multi Asic] support of swss.rec and sairedis.rec for multi asic (#6310)
Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan arlakshm@microsoft.com

- Why I did it
This PR has the changes to support having different swss.rec and sairedis.rec for each asic.
The logrotate script is updated as well

- How I did it

Update the orchagent.sh script to use the logfile name options in these PRs(Azure/sonic-swss#1546 and Azure/sonic-sairedis#747)
In multi asic platforms the record files will be different for each asic, with the format swss.asic{x}.rec and sairedis.asic{x}.rec

Update the logrotate script for multiasic platform .
2021-02-23 23:56:01 +00:00
Joe LeVeque
d7517a704c [PDDF] Build and install Python 3 package (#6286)
- Make PDDF code compliant with both Python 2 and Python 3
- Align code with PEP8 standards using autopep8
- Build and install both Python 2 and Python 3 PDDF packages
2021-02-23 23:56:01 +00:00
pra-moh
c240929607 [StreamingTelemetry] add noTLS support for debug purpose (#6704)
adding noTLS mode for debugging purpose
Removing config-set for port 8080. It fails to start telemetry if docker restarts in case on noTLS mode because it expects log_level config to be present as well.
2021-02-23 13:55:01 -08:00
yozhao101
cdef77f4c5 [SwSS] Disabled the autorestart of process coppmgrd. (#6774)
coppmgrd process do not need to be auto-restarted if it exited unexpectedly.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
2021-02-16 15:32:36 -08:00
Guohan Lu
89eb0c4a3c [docker-fmp-frr]: remove blank lines in generated critical_process
Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-28 09:29:09 -08:00
yozhao101
cc9c3f567e [supervisord] Monitoring the critical processes with supervisord. (#6242)
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.

Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.

- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:

The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.

If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.

- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.

Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.

- Which release branch to backport (provide reason below if selected)

 201811
 201911
[x ] 202006
2021-01-28 09:28:27 -08:00
Shi Su
fc825f9a58 [FRR] Create a separate script to wait zebra to be ready to receive connections (#6519)
The requirement for zebra to be ready to accept connections is a generic problem that is not 
specific to bgpd. Making the script to wait for zebra socket a separate script and let bgpd and 
staticd to wait for zebra socket.
2021-01-28 09:25:29 -08:00
Guohan Lu
0b2abafcde [docker-ptf]: build docker ptf
- combine docker-ptf-saithrift into docker-ptf docker
- build docker-ptf under platform vs
- remove docker-ptf for other platforms

Signed-off-by: Guohan Lu <lguohan@gmail.com>
2021-01-28 09:23:12 -08:00
Tamer Ahmed
80ba3a4f6f [dhcp-relay]: Launch DHCP Relay On L3 Vlan (#6527)
Recent changes brought l2 vlan concept which do not have DHCP
clients behind them and so DHCP relay is not required. Also,
dhcpmon fails to launch on those vlans as their interfaces
lack IP addresses. This PR limit launch of both DHCP relay
and dhcpmon to L3 vlans only.

singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
2021-01-28 09:21:49 -08:00
Zhenhong Zhao
154a6ab6c5 [frrcfgd] introduce frrcfgd to manage frr config when frr_mgmt_framework_config is true (#5142)
- Support for non-template based FRR configurations (BGP, route-map, OSPF, static route..etc) using config DB schema.
- Support for save & restore - Jinja template based config-DB data read and apply to FRR during startup

**- How I did it**

- add frrcfgd service
- when frr_mgmg_framework_config is set, frrcfgd starts in bgp container
- when user changed the BGP or other related table entries in config DB, frrcfgd will run corresponding VTYSH commands to program on FRR.
- add jinja template to generate FRR config file to be used by FRR daemons while bgp container restarted

**- How to verify it**
1. Add/delete data on config DB and then run VTYSH "show running-config" command to check if FRR configuration changed.
1. Restart bgp container and check if generated FRR config file is correct and run VTYSH "show running-config" command to check if FRR configuration is consistent with attributes in config DB

Co-authored-by: Zhenhong Zhao <zhenhong.zhao@dell.com>
2021-01-24 22:43:30 -08:00
Samuel Angebault
9dfee1f09a [pmon]: Run ledd using python3 unless excluded (#6528)
**- Why I did it**

Ledd is the last daemon that is not enabled to run in python3.
Even though there is a plan to deprecate this daemon and to replace it by something else it's one simple step toward python2 deprecation.

**- How I did it**

Changed the `command=` line for `ledd` in the `supervisord` configuration of `pmon`.
Copied what was done for other daemons.

**- How to verify it**

Booting a product that has a `led_control.py` should now show the ledd running in python3.
I ran `python3 -m pylint` on all `led_control.py` plugin which means that most of them should be python3 compliant.
There is however still a risk that some might not work.
2021-01-22 11:42:22 -08:00