Backport https://github.com/Azure/sonic-buildimage/pull/9068 to 202012
#### Why I did it
Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.
#### How I did it
1. Get container names from FEATURE table
2. For each container, collect critical process names from file critical_processes
3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit
#### How to verify it
1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
Update sonic-sairedis submodule to get the below fixes:
7389704 [202012] Add ACL_TABLE object to break before make list (Azure/sonic-sairedis#971)
f334349 Fix hung issue when installing linux kernel modules (Azure/sonic-sairedis#969)
When we update the a sai package downing from a remote server, we need to update the version file as well currently, but the reproducible build feature is not enabled in master, it can only be detected when merging the code into the release branches, such as 202106, 202012, etc.
The reproducible feature is to reduce the build failure, not need to break the build when the version not specified. If version not specified, the best choice is to accept the version from remote server.
Co-authored-by: Ubuntu <xumia@xumia-vm1.jqzc3g5pdlluxln0vevsg3s20h.xx.internal.cloudapp.net>
#### Why I did it
With current code the delay will take place even if simple 'config reload' command executed and this is not desired.
This delay should be used only when fast-rebooting.
#### How I did it
Change the type of delay to OnBootSec instead of OnActiveSec.
#### How to verify it
Fast-reboot with this PR and observe the delay.
Run 'config-reload' command and observe no delay is running.
The recent release of redis 4.0.0 or newer (for python3) breaks sonic-config-engine unit test. Fix to last known good version.
ref: https://pypi.org/project/redis/#history
[cherry-pick PR #9123 ]
Why I did it
When sshd realizes that this login can't succeed due to internal device state
or configuration, instead of failing right there, it proceeds to prompt for
password, so as the user does not get any clue on where is the failure point.
Yet to ensure that this login does not proceed, sshd replaces user provided password
with a specific pattern of characters matching length of user provided password.
This pattern is "<BS><LF><CR><DEL>INCORRECT", which is bound to fail.
If user provided length is smaller/equal, the substring of pattern is overwritten.
If user provided length is greater, the pattern is repeated until length is exhausted.
But if the PAM-tacacs plugin would send this password to AAA, the user could get
locked out by AAA, for providing incorrect value.
How I did it
Hence this fix, matches obtained password against the pattern. If match, fail just before
reaching AAA server.
How to verify it
Make sure tacacs is properly configured.
Try logging in as, say "user-A"; ensure it succeeds
Pick another user, say user-B and ensure this user has not logged into this device before (look into /etc/passed & folders under /home)
Disable monit service (as that could fix the issue using disk_check.py)
Start TCP dump for all TACACS servers.
Simulate Read-only disk
Try logging in using user-B.
Verify it fails, after 3 attempts
Stop tcp dump.
TCP dump should show "authentication" for user-A only
6f198d0 (HEAD -> 202012, origin/202012) [Y-Cable][Broadcom] upgrade to support Broadcom Y-Cable API to release (#230)
1c3e422 SSD Health: Retrieve SSD health and temperature values from generic SSD info (#229)
Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
1. CS00012211718 [4.3] Pfcwd getting continuously triggered/restored when pause frames are sent continuously to both queues of a port (TD2/Th/Th2/TD3) MSFT Default
Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
```
fib/test_fib.py
vxlan/test_vxlan_decap.py
fdb/test_fdb.py
decap/test_decap.py
ipfwd/test_dip_sip.py
ipfwd/test_dir_bcast.py
acl/test_acl.py
vlan/test_vlan.py
platform_tests/test_reboot.py
```
9dd3025 2021-05-11 | [Command-Reference.md] Document new SNMP show and config commands (#1600) [Travis Van Duyn]
be40767 2021-05-05 | [show][config] Add new snmp commands (#1347) [Travis Van Duyn]
- add a new service "mark_dhcp_packet" to mux container
- apply packet marks on a per-interface basis in ebtables
- write packet marks to "DHCP_PACKET_MARK" table in state_db
Fix support for DHCPV6 Relay multi vlan functionality. Make sure the relayed packet is received at correct interface.
How I did it
Bind a socket to each vlan interface's global and link-local address.
Socket binded to global address is used for relaying data from client to server and receiving data from servers.
Socket binded to link-local address is used for relaying data received from server back to the client.
This is to pick up BRCM SAI 4.3.5.1-7 fixes which contains the following fixes:
1. CS00012209390: SONIC-50037, Used SAI_SWITCH_ATTR_QOS_DSCP_TO_TC_MAP as a default decap map for IPinIP tunnels.
2. CS00012212995: SONIC-50948 SAI_API_QUEUE:_brcm_sai_cosq_stat_get:1353 egress Min limit get failed with error Invalid parameter
3. SONIC-51583: Fixed acl group member creation failure with priority of -1
4. CS00012215744:SONIC-51395 [TH, TH2] WB 3.5 to 4.3 fails at APPLY_VIEW while setting SAI_PORT_ATTR_EGRESS_ACL
5. SONIC-51638: SDK-249337 ERROR: AddressSanitizer: heap-buffer-overflow in _tlv_print_array
Preliminary tests look fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the following test cases on 7050CX3 (TD3) T0 DUT and all passed:
```
fib/test_fib.py
vxlan/test_vxlan_decap.py
fdb/test_fdb.py
decap/test_decap.py
ipfwd/test_dip_sip.py
ipfwd/test_dir_bcast.py
acl/test_acl.py
vlan/test_vlan.py
platform_tests/test_reboot.py
```
[write_standby]: Ignore non-auto interfaces
* In the event that `write_standby.py` is used to automatically switchover interfaces when linkmgrd or bgp crashes, ignore any interfaces that are not configured to auto-switch
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
[mux] Update Service Install With SONiC Target
Recent PR grouped all SONiC service into sonic.taget. The install section
of mux.service was not update and this causes delays when using config
reload as the service failed state is not being reset.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
[mux] Start Mux on Only Dual-ToR Platform
mux docker depends on the presence of mux cable hardware and is
supposed to run only Gemini ToRs. This PR change the mux feature
config in order to enable mux docker based on device configuration.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
This PR deletes local-to-buildimage linkmgrd and creates new submodule
pointing to github repo of sonic-linkmgrd.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
During warm reboot, linkmgrd would go away and so heartbeats will
be lost. This would result in standby link son peer ToR to pull the
link active. This is undesirable since we would not create tunnel
from the ToR that is being rebooted to the peer ToR. This PR
implicitly lock the state of the mux if config is not set to auto.
Also, orchagent does not initialize MUX to it hardware state, rather
it initilizes MUX to Unknown state. linkmgrd will detect this situation
and probe MUX state to correct orchagent state.
There a fix for the case when state os switched MUX is delayed. The
PR will poll the MUX for the new state. This is required to update
the state ds and hence create/tear tunnel.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Linkmgrd monitors link status, mux status, and link state. Has
the link becomes unhealthy, linkmgrd will trigger mux switchover
on a standby ToR ensuring uninterrupted service to servers/blades.
This PR is initial implementation of linkmgrd.
Also, docker-mux container hold packages related to maintaining and managing
mux cable. It currently runs linkmgrd binary that monitor and switches
the mux if needed.
This PR also introduces mux-container and starts linkmgrd as startup when
build is configured with INCLUDE_MUX=y
Edit: linkmgrd PR will follow.
signed-off-by: Tamer Ahmed <tamer.ahmed@microsoft.com>
Related work items: #2315, #3146150
The dhcp6relay rules file had a line overwriting a variable for
docker-dhcp-relay. Remove that line.
This line caused a limited impact where if some (many?) of the docker
containers were already built, except for dhcp-relay, and the build
failed or was interrupted, then dhcp-relay container would fail to build
because this variable was overwritten and the python3-swsscommon
wouldn't get installed into the slave container. Most builds would be
fine, though.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
This is to address an issue where it was observed that SAI operations sometime make take a very long to time complete (over 45ms). It was determined that the ALPM distributed thread was causing this issue.
The fix is to disable this debug thread that has no functional purpose.
Preliminary tests looks fine. BGP neighbors were all up with proper routes programmed
interfaces are all up
Manually ran the fib test cases on 7050CX3 (TD3), TD2, TH, TH2, and TH3 based platforms and
thy all passed.
Why I did it
During swss container startup, if ndppd starts up before/with vlanmgrd, ndppd will be pinned at nearly 100% CPU usage.
How I did it
Only start ndppd after vlanmgrd is running. Also, call ndppd directly instead of through bash for improved logging and to prevent orphaned processes.
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>