Modify snmpd.conf to start snmpd to listen on specific management and loopback ips instead of listening on any ip.
#### Why I did it
SNMP over IPv6 is not working for all scenarios for a single asic platforms.
The expectation is that SNMP query over IPv6 should work over Management or Loopback0 addresses.
**Specific scenario where this issue is seen**
In case of Lab T0 device, when SNMP request is sent from a directly connected T1 neighbor over Loopback IP, SNMP response was not received.
This was because the SRC IP address in SNMP response was not Loopback IP, it was the PortChannel IP connected to the neighboring device.
```
23:18:51.620897 In 22:26:27:e6:e0:07 ethertype IPv6 (0x86dd), length 105: fc00::72.41725 > **fc00:1::32**.161: C="msft" **GetRequest**(28) .1.3.6.1.2.1.1.1.0
23:18:51.621441 Out 28:99:3a:a0:97:30 ethertype IPv6 (0x86dd), length 241: **fc00::71**.161 > fc00::72.41725: C="msft" **GetResponse**(162) .1.3.6.1.2.1.1.1.0="SONiC Software Version: SONiC.xxx - HwSku: xx - Distribution: Debian 10.13 - Kernel: 4.19.0-12-2-amd64"
```
In case of IPv4, the SRC IP in SNMP response was correctly set to Loopback IP.
```
23:25:32.769712 In 22:26:27:e6:e0:07 ethertype IPv4 (0x0800), length 85: 10.0.0.57.56701 > **10.1.0.32**.161: C="msft" **GetRequest**(28) .1.3.6.1.2.1.1.1.0
23:25:32.975967 Out 28:99:3a:a0:97:30 ethertype IPv4 (0x0800), length 221: **10.1.0.32**.161 > 10.0.0.57.56701: C="msft" **GetResponse**(162) .1.3.6.1.2.1.1.1.0="SONiC Software Version: SONiC.xxx - HwSku: xx - Distribution: Debian 10.13 - Kernel: 4.19.0-12-2-amd64"
```
**Sequence of SNMP request and response**
1. SNMP request will be sent with SRC IP fc00::72 DST IP fc00:1::32
2. SNMP request is received at SONiC device is sent to snmpd which is listening on port 161 :::161/
3. snmpd process will parse the request create a response and sent to DST IP fc00::72.
snmpd process does not track the DST IP on which the SNMP request was received, which in this case is Loopback IP.
snmpd process will only keep track what is tht IP to which the response should be sent to.
4. snmpd process will send the response packet.
5. Kernel will do a route look up on destination IP and find the best path.
ip -6 route get fc00::72
fc00::72 from :: dev PortChannel101 proto kernel src fc00::71 metric 256 pref medium
5. Using the "src" ip from about, the response is sent out. This SRC ip is that of the PortChannel and not the device Loopback IP.
The same issue is seen when SNMP query is sent from a remote server over Management IP.
SONiC device eth0 --------- Remote server
SNMP request comes with SRC IP <Remote_server> DST IP <Mgmt IP>
If kernel finds best route to Remote_server_IP is via BGP neighbors, then it will send the response via front-panel interface with SRC IP as Loopback IP instead of Management IP.
Main issue is that in case of IPv6, snmpd ignores the IP address to which SNMP request was sent, in case of IPv6.
In case of IPv4, snmpd keeps track of DST IP of SNMP request, it will keep track if the SNMP request was sent to mgmt IP or Loopback IP.
Later, this IP is used in ipi_spec_dst as SRC IP which helps kernel to find the route based on DST IP using the right SRC IP.
https://github.com/net-snmp/net-snmp/blob/master/snmplib/transports/snmpUDPBaseDomain.c#L300
ipi.ipi_spec_dst.s_addr = srcip->s_addr
Reference: https://man7.org/linux/man-pages/man7/ip.7.html
```
If IP_PKTINFO is passed to sendmsg(2)
and ipi_spec_dst is not zero, then it is used as the local
source address for the routing table lookup and for
setting up IP source route options. When ipi_ifindex is
not zero, the primary local address of the interface
specified by the index overwrites ipi_spec_dst for the
routing table lookup.
```
**This issue is not seen on multi-asic platform, why?**
on multi-asic platform, there exists different network namespaces.
SNMP docker with snmpd process runs on host namespace.
Management interface belongs to host namespace.
Loopback0 is configured on asic namespaces.
Additional inforamtion on how the packet coming over Loopback IP reaches snmpd process running on host namespace: https://github.com/sonic-net/sonic-buildimage/pull/5420
Because of this separation of network namespaces, the route lookup of destination IP is confined to routing table of specific namespace where packet is received.
if packet is received over management interface, SNMP response also is sent out of management interface. Same goes with packet received over Loopback Ip.
##### Work item tracking
- Microsoft ADO **17537063**:
#### How I did it
Have snmpd listen on specific Management and Loopback IPs specifically instead of listening on any IP for single-asic platform.
Before Fix
```
admin@xx:~$ sudo netstat -tulnp | grep 161
udp 0 0 0.0.0.0:161 0.0.0.0:* 15631/snmpd
udp6 0 0 :::161 :::* 15631/snmpd
```
After fix
```
admin@device:~$ sudo netstat -tulnp | grep 161
udp 0 0 10.1.0.32:161 0.0.0.0:* 215899/snmpd
udp 0 0 10.3.1.1:161 0.0.0.0:* 215899/snmpd
udp6 0 0 fc00:1::32:161 :::* 215899/snmpd
udp6 0 0 fc00:2::32:161 :::* 215899/snmpd
```
**How this change helps with the issue?**
To see snmpd trace logs, modify snmpd to start using the below parameters, in supervisord.conf file
```
/usr/sbin/snmpd -f -LS0-7i -Lf /var/log/snmpd.log
```
When snmpd listens on any IP, snmpd binds to IPv4 and IPv6 sockets as below:
```
netsnmp_udpbase: binding socket: 7 to UDP: [0.0.0.0]:0->[0.0.0.0]:161
trace: netsnmp_udp6_transport_bind(): transports/snmpUDPIPv6Domain.c, 303:
netsnmp_udpbase: binding socket: 8 to UDP/IPv6: [::]:161
```
When IPv4 response is sent, it goes out of fd 7 and IPv6 response goes out of fd 8.
When IPv6 response is sent, it does not have the right SRC IP and it can lead to the issue described.
When snmpd listens on specific Loopback/Management IPs, snmpd binds to different sockets:
```
trace: netsnmp_udpipv4base_transport_bind(): transports/snmpUDPIPv4BaseDomain.c, 207:
netsnmp_udpbase: binding socket: 7 to UDP: [0.0.0.0]:0->[10.250.0.101]:161
trace: netsnmp_udpipv4base_transport_bind(): transports/snmpUDPIPv4BaseDomain.c, 207:
netsnmp_udpbase: binding socket: 8 to UDP: [0.0.0.0]:0->[10.1.0.32]:161
trace: netsnmp_register_agent_nsap(): snmp_agent.c, 1261:
netsnmp_register_agent_nsap: fd 8
netsnmp_udpbase: binding socket: 10 to UDP/IPv6: [fc00:1::32]:161
trace: netsnmp_register_agent_nsap(): snmp_agent.c, 1261:
netsnmp_register_agent_nsap: fd 10
netsnmp_ipv6: fmtaddr: t = (nil), data = 0x7fffed4c85d0, len = 28
trace: netsnmp_udp6_transport_bind(): transports/snmpUDPIPv6Domain.c, 303:
netsnmp_udpbase: binding socket: 9 to UDP/IPv6: [fc00:2::32]:161
```
When SNMP request comes in via Loopback IPv4, SNMP response is sent out of fd 8
```
trace: netsnmp_udpbase_send(): transports/snmpUDPBaseDomain.c, 511:
netsnmp_udp: send 170 bytes from 0x5581f2fbe30a to UDP: [10.0.0.33]:46089->[10.1.0.32]:161 on fd 8
```
When SNMP request comes in via Loopback IPv6, SNMP response is sent out of fd 10
```
netsnmp_ipv6: fmtaddr: t = (nil), data = 0x5581f2fc2ff0, len = 28
trace: netsnmp_udp6_send(): transports/snmpUDPIPv6Domain.c, 164:
netsnmp_udp6: send 170 bytes from 0x5581f2fbe30a to UDP/IPv6: [fc00::42]:43750 on fd 10
```
#### How to verify it
Verified on single asic and multi-asic devices.
Single asic SNMP query with Loopback
```
ARISTA01T1#bash snmpget -v2c -c xxx 10.1.0.32 1.3.6.1.2.1.1.1.0
SNMPv2-MIB::sysDescr.0 = STRING: SONiC Software Version: SONiC.xx - HwSku: Arista-7260xx - Distribution: Debian 10.13 - Kernel: 4.19.0-12-2-amd64
ARISTA01T1#bash snmpget -v2c -c xxx fc00:1::32 1.3.6.1.2.1.1.1.0
SNMPv2-MIB::sysDescr.0 = STRING: SONiC Software Version: SONiC.xx - HwSku: Arista-7260xxx - Distribution: Debian 10.13 - Kernel: 4.19.0-12-2-amd64
```
On multi-asic -- no change.
```
sudo netstat -tulnp | grep 161
udp 0 0 0.0.0.0:161 0.0.0.0:* 17978/snmpd
udp6 0 0 :::161 :::* 17978/snmpd
```
Query result using Loopback IP from a directly connected BGP neighbor
```
ARISTA01T2#bash snmpget -v2c -c xxx 10.1.0.32 1.3.6.1.2.1.1.1.0
SNMPv2-MIB::sysDescr.0 = STRING: SONiC Software Version: SONiC.xx - HwSku: xx - Distribution: Debian 9.13 - Kernel: 4.9.0-14-2-amd64
ARISTA01T2#bash snmpget -v2c -c xxx fc00:1::32 1.3.6.1.2.1.1.1.0
SNMPv2-MIB::sysDescr.0 = STRING: SONiC Software Version: SONiC.xx - HwSku: xx - Distribution: Debian 9.13 - Kernel: 4.9.0-14-2-amd64
```
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->
Co-authored-by: SuvarnaMeenakshi <50386592+SuvarnaMeenakshi@users.noreply.github.com>
* Remove apt package lists and make macro to clean up apt and python cache
Remove the apt package lists (`/var/lib/apt/lists`) from the docker
containers. This saves about 100MB.
Also, make a macro to clean up the apt and python cache that can then be
used in all of the containers. This helps make the cleanup be consistent
across all containers.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Currently, the build dockers are created as a user dockers(docker-base-stretch-<user>, etc) that are
specific to each user. But the sonic dockers (docker-database, docker-swss, etc) are
created with a fixed docker name and common to all the users.
docker-database:latest
docker-swss:latest
When multiple builds are triggered on the same build server that creates parallel building issue because
all the build jobs are trying to create the same docker with latest tag.
This happens only when sonic dockers are built using native host dockerd for sonic docker image creation.
This patch creates all sonic dockers as user sonic dockers and then, while
saving and loading the user sonic dockers, it rename the user sonic
dockers into correct sonic dockers with tag as latest.
docker-database:latest <== SAVE/LOAD ==> docker-database-<user>:tag
The user sonic docker names are derived from 'DOCKER_USERNAME and DOCKER_USERTAG' make env
variable and using Jinja template, it replaces the FROM docker name with correct user sonic docker name for
loading and saving the docker image.
#### Why I did it
resolves https://github.com/Azure/sonic-buildimage/issues/8779
snmpd writes the below error message in syslog :
snmp#snmpd[27]: truncating integer value > 32 bits
This message is written in syslog when the hrSystemUptime(1.3.6.1.2.1.25.1.1.0 / system uptime) or sysUpTime(1.3.6.1.2.1.1.3 network management portion or snmpd uptime) is queried when either of these counters overflow beyond 32 bit value. This happens the device uptime or snmpd uptime is more than 497 days.
#### How I did it
Reference: https://access.redhat.com/solutions/367093 and https://linux.die.net/man/1/snmpcmd
To avoid seeing this message if the counter grows, the snmpd error log level is changed to display LOG_EMERG, LOG_ALERT, LOG_CRIT, and LOG_DEBUG.
Without this change, LOG_ERR and LOG_WARNING would also be logged in syslog.
#### How to verify it
On a device which is up for more than 497 days, modify supervisord.conf with the change and restart snmp.
Query 1.3.6.1.2.1.1.3 and verify that log message is not seen.
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
Currently we leveraged the Supervisor to monitor the running status of critical processes in each container and it is more reliable and flexible than doing the monitoring by Monit. So we removed the functionality of monitoring the critical processes by Monit.
How I did it
I removed the script process_checker and corresponding Monit configuration entries of critical processes.
How to verify it
I verified this on the device str-7260cx3-acs-1.
- Why I did it
Initially, we used Monit to monitor critical processes in each container. If one of critical processes was not running
or crashed due to some reasons, then Monit will write an alerting message into syslog periodically. If we add a new process
in a container, the corresponding Monti configuration file will also need to update. It is a little hard for maintenance.
Currently we employed event listener of Supervisod to do this monitoring. Since processes in each container are managed by
Supervisord, we can only focus on the logic of monitoring.
- How I did it
We borrowed the event listener of Supervisord to monitor critical processes in containers. The event listener will take
following steps if it was notified one of critical processes exited unexpectedly:
The event listener will first check whether the auto-restart mechanism was enabled for this container or not. If auto-restart mechanism was enabled, event listener will kill the Supervisord process, which should cause the container to exit and subsequently get restarted.
If auto-restart mechanism was not enabled for this contianer, the event listener will enter a loop which will first sleep 1 minute and then check whether the process is running. If yes, the event listener exits. If no, an alerting message will be written into syslog.
- How to verify it
First, we need checked whether the auto-restart mechanism of a container was enabled or not by running the command show feature status. If enabled, one critical process should be selected and killed manually, then we need check whether the container will be restarted or not.
Second, we can disable the auto-restart mechanism if it was enabled at step 1 by running the commnad sudo config feature autorestart <container_name> disabled. Then one critical process should be selected and killed. After that, we will see the alerting message which will appear in the syslog every 1 minute.
- Which release branch to backport (provide reason below if selected)
201811
201911
[x ] 202006
**- Why I did it**
I'm updating the jinja2 template to support getting SNMP information from the redis configdb.
I'm using the format approved here:
https://github.com/Azure/SONiC/pull/718
This will pave the way for us to decrement using the snmp.yml in the future.
Right now we will still be using both the snmp.yml and configdb to get variable information in order to create the snmpd.conf via the sonic-cfggen tool.
**- How I did it**
I first updated the SNMP Schema in PR #718 to get that approved as a standardized format.
Then I verified I could add snmp configs to the configdb using this standard schema. Once the configs were added to the configdb then I updated the snmpd.conf.j2 file to support the updates via the configdb while still using the variables in the snmp.yml file in parallel. This way we will have backward compatibility until we can fully migrate to the configdb only.
By updating the snmpd.conf.j2 template and running the sonic-cfggen tool the snmpd.conf gets generated with using the values in both the configdb and snmp.yml file.
Co-authored-by: trvanduy <trvanduy@microsoft.com>
This PR is in preparation to move from snmp.yml to configdb. This will more closely align with other commands in sonic and use configdb as the source of truth for snmp configuration.
Note: This is the first of 2 PR's to enable this. This PR will not change any functionality but will allow the snmp.yml file info to be put into the configdb.
Created a script that takes the snmp.yml variables and converts them to the configdb format.
Added file to dockerfile.j2 so that file is copied in the container.
Updated start.sh file to automatically run the python conversion script each time the docker container is restarted.
**- Why I did it**
As part of migrating SONiC codebase from Python 2 to Python 3
**- How I did it**
- No longer install Python 2 in docker-base-buster or docker-config-engine-buster.
- Install Python 2 and pip2 in the following containers until we can completely eliminate it there:
- docker-platform-monitor
- docker-sonic-mgmt-framework
- docker-sonic-vs
- Pin pip2 version <21 where it is still temporarily needed, as pip version 21 will drop support for Python 2
- Also preform some other cleanup, ensuring that pip3, setuptools and wheel packages are installed in docker-base-buster, and then removing any attempts to re-install them in derived containers
* First cut image update for kubernetes support.
With this,
1) dockers dhcp_relay, lldp, pmon, radv, snmp, telemetry are enabled
for kube management
init_cfg.json configure set_owner as kube for these
2) Each docker's start.sh updated to call container_startup.py to register going up
As part of this call, it registers the current owner as local/kube and its version
The images are built with its version ingrained into image during build
3) Update all docker's bash script to call 'container start/stop/wait' instead of 'docker start/stop/wait'.
For all locally managed containers, it calls docker commands, hence no change for locally managed.
4) Introduced a new ctrmgrd service, that helps with transition between owners as kube & local and carry over any labels update from STATE-DB to API server
5) hostcfgd updated to handle owner change
6) Reboot scripts are updatd to tag kube running images as local, so upon reboot they run the same image.
7) Added kube_commands.py to handle all updates with Kubernetes API serrver -- dedicated for k8s interaction only.
**- Why I did it**
We were building a custom version of Supervisor because I had added patches to prevent hangs and crashes if the system clock ever rolled backward. Those changes were merged into the upstream Supervisor repo as of version 3.4.0 (http://supervisord.org/changes.html#id9), therefore, we should be able to simply install the vanilla package via pip. This will also allow us to easily move to Python 3, as Python 3 support was added in version 4.0.0.
**- How I did it**
- Remove Makefiles and patches for building supervisor package from source
- Install Python 3 supervisor package version 4.2.1 in Buster base container
- Also install Python 3 version of supervisord-dependent-startup in Buster base container
- Debian package installed binary in `/usr/bin/`, but pip package installs in `/usr/local/bin/`, so rather than update all absolute paths, I changed all references to simply call `supervisord` and let the system PATH find the executable to prevent future need for changes just in case we ever need to switch back to build a Debian package, then we won't need to modify these again.
- Install Python 2 supervisor package >= 3.4.0 in Stretch and Jessie base containers
Why/How I did:
Make sure first error syslog is triggered based on FAULT TOLERANCE condition.
Added support of repeat clause with alert action. This is used as trigger
for generation of periodic syslog error messages if error is persistent
Updated the monit conf files with repeat every x cycles for the alert action
We want to let Monit to unmonitor the processes in containers which are disabled in `FEATURE` table such that
Monit will not generate false alerting messages into the syslog.
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
When stopping the swss, pmon or bgp containers, log messages like the following can be seen:
```
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,061 ERRO pool dependent-startup event buffer overflowed, discarding event 34
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,063 ERRO pool dependent-startup event buffer overflowed, discarding event 35
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,064 ERRO pool dependent-startup event buffer overflowed, discarding event 36
Aug 23 22:50:43.789760 sonic-dut INFO swss#supervisord 2020-08-23 22:50:10,066 ERRO pool dependent-startup event buffer overflowed, discarding event 37
```
This is due to the number of programs in the container managed by supervisor, all generating events at the same time. The default event queue buffer size in supervisor is 10. This patch increases that value in all containers in order to eliminate these errors. As more programs are added to the containers, we may need to further adjust these values. I increased all buffer sizes to 25 except for containers with more programs or templated supervisor.conf files which allow for a variable number of programs. In these cases I increased the buffer size to 50. One final exception is the swss container, where the buffer fills up to ~50, so I increased this buffer to 100.
Resolves https://github.com/Azure/sonic-buildimage/issues/5241
Calls to sonic-cfggen is CPU expensive. This PR reduces calls to
sonic-cfggen to once calle during snmp startup
singed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
**- Why I did it**
Initially, the critical_processes file contains either the name of critical process or the name of group.
For example, the critical_processes file in the dhcp_relay container contains a single group name
`isc-dhcp-relay`. When testing the autorestart feature of each container, we need get all the critical
processes and test whether a container can be restarted correctly if one of its critical processes is
killed. However, it will be difficult to differentiate whether the names in the critical_processes file are
the critical processes or group names. At the same time, changing the syntax in this file will separate the individual process from the groups and also makes it clear to the user.
Right now the critical_processes file contains two different kind of entries. One is "program:xxx" which indicates a critical process. Another is "group:xxx" which indicates a group of critical processes
managed by supervisord using the name "xxx". At the same time, I also updated the logic to
parse the file critical_processes in supervisor-proc-event-listener script.
**- How to verify it**
We can first enable the autorestart feature of a specified container for example `dhcp_relay` by running the comman `sudo config container feature autorestart dhcp_relay enabled` on DUT. Then we can select a critical process from the command `docker top dhcp_relay` and use the command `sudo kill -SIGKILL <pid>` to kill that critical process. Final step is to check whether the container is restarted correctly or not.
The `-sv2` suffix was used to differentiate SNMP Dockers when we transitioned from "SONiCv1" to "SONiCv2", about four years ago. The old Docker materials were removed long ago; there is no need to keep this suffix. Removing it aligns the name with all the other Dockers.
Also edit Monit configuration to detect proper snmp-subagent command line in Buster, and make snmpd command line matching more robust.
* Adding support for V2 in SNMP/LLDP (-sv2 postfix)
* Fixes for V1 containers: logging
* Fixes for V1 LLDP: limit LLDP to Front-panel or MGMT interfaces.