#### Why I did it
There might be a case where service checker periodic operation determined that specific container is running but when it tries to perform an operation on it, it was already closed by the user. This is a valid flow and we should not log an error message, informative warning is enough.
#### How I did it
I reduce log severity.
#### How to verify it
I verified it manually.
Why I did it
The t0-sonic pool has been fixed, so add it back to azp checker.
How I did it
Remove continueOnError in run-test-template.yml.
Signed-off-by: Ze Gan <ganze718@gmail.com>
#### Why I did it
SSHD keepalive timeout feature not enabled on sonic.
#### How I did it
Enable SSHD keepalive timeout feature by set ClientAliveCountMax to 1.
#### How to verify it
Pass All E2E test case.
Manually test with following steps:
1. Change config and restart sshd
2. Connect a ssh with -vvv option to show debug message
3. Get running ssh by command and stop it:
```
azureuser@liuh-dev-vm-02:~$ ps -auxww | grep vvv
azureus+ 1614153 0.0 0.0 12244 6004 pts/1S+ 15:48 0:00 ssh admin@10.250.0.101 -vvv
azureus+ 1615570 0.0 0.0 8168 2424 pts/3S+ 15:49 0:00 grep --color=auto vvv
azureuser@liuh-dev-vm-02:~$ kill -Stop 1614153
```
4. Check TCP status from server side with ss command:
https://man7.org/linux/man-pages/man8/ss.8.html
```
admin@vlab-01:~$ ss | grep -i ssh
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:58150
tcp FIN-WAIT-2 0 010.250.0.101:ssh 10.250.0.1:58164
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:57978
```
FIN-WAIT-2 means server already terminate the connection and wait for client response:
https://kb.iu.edu/d/ajmi
. FIN-WAIT-2 <-- <SEQ=300><ACK=101><CTL=ACK> <-- CLOSE-WAIT
5. Check again later will show the session been complete closed:
```
admin@vlab-01:~$ ss | grep -i ssh
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:58150
tcp ESTAB 0 010.250.0.101:ssh 10.250.0.1:57978
```
- Why I did it
When LLDP is disabled through feature command, it gets spawned after reboot.
- How I did it
In syncd.sh check if the service is enabled before spawning automatically during cold reboot.
- How to verify it
Disable lldp feature. Perform cold reboot and verify its not spawned.
- Why I did it
Need to execute mlxreg inside pmon docker
- How I did it
Add MFT package to pmon Makefile
- How to verify it
Install image, go to pmon : docker exec -it pmon bash, exec mlxreg
Verifiy warm, fast and cold reboot while MFT is being called in pmon constantly
Signed-off-by: Andriy Yurkiv <ayurkiv@nvidia.com>
Signed-off-by: bingwang <bingwang@microsoft.com>
Why I did it
This PR is to add two extra lossless queues for bounced back traffic.
HLD sonic-net/SONiC#950
SKUs include
Arista-7050CX3-32S-C32
Arista-7050CX3-32S-D48C8
Arista-7260CX3-D108C8
Arista-7260CX3-C64
Arista-7260CX3-Q64
How I did it
Update the buffers.json.j2 template and buffers_config.j2 template to generate new BUFFER_QUEUE table.
For T1 devices, queue 2 and queue 6 are set as lossless queues on T0 facing ports.
For T0 devices, queue 2 and queue 6 are set as lossless queues on T1 facing ports.
Queue 7 is added as a new lossy queue as DSCP 48 is mapped to TC 7, and then mapped into Queue 7
How to verify it
Verified by UT
Verified by coping the new template and generate buffer config with sonic-cfggen
* [ci] Support to skip vstest using include/exclude config file. (#11086)
example:
├── folderA
│ ├── fileA (skip vstest)
│ ├── fileB
│ └── fileC
If we want to skip vstest when changing /folderA/fileA, and not skip vstest when changing fileB or fileC.
vstest-include:
^folderA/fileA
vstest-exclude:
^folderA
* [build] Add version files to docker image dependencies
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to fix an issue (#10088) by enhancing the script memory_checker.
Specifically, if container is not created successfully during device is booted/rebooted, then memory_checker do not need check its memory usage.
How I did it
In the script memory_checker, a function is added to get names of running containers. If the specified container name is not in current running container list, then this script will exit without checking its memory usage.
How to verify it
I tested on a lab device by following the steps:
Stops telemetry container with command sudo systemctl stop telemetry.service
Removes telemetry container with command docker rm telemetry
Checks whether the script memory_checker ran by Monit will generate the syslog message saying it will exit without checking memory usage of telemetry.
Why I did it
The docker storage driver vfs is not a good option for build, it uses the “deep copy” when building a new layer, leads to lower performance and more space used on disk than other storage drivers.
A better docker storage driver is the default one overlay2, it is a modern union filesystem.
- Why I did it
Recent change to delay PMON service in case of fast/warm reboot introduce an issue when restarting only SWSS service after fast/warm reboot for Nvidia platform.
Since the timer is triggered only when the system boot, in a scenario when the system is after a fast/warm reboot and the user restart SWSS service, as part of syncd.sh script, PMON service will stop but the timer will not start again.
- How I did it
On syncd.sh script, in case of fast/warm indication, check if pmon.timer is running.
If it is running it means we are at the first boot and continue normally.
If it is not running, meaning the service was restarted, start the timer to keep the system behavior consistent.
- How to verify it
Run fast/warm reboot.
service swss restart.
Observe PMON service starting.
Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
Why I did it
Recently the nightly testing pipeline found that the autorestart test case was failed when it was run against master image. The reason is Restart= field in each container's systemd configuration file was set to Restart=no even the value of auto_restart field in FEATURE table of CONFIG_DB is enabled.
This issue introduced by #10168 can be reproduced by the following steps:
Issues the config command to disable the auto-restart feature of a container
Runs command config reload or config reload minigraph to enable auto-restart of the container
Checks Restart= field in the container's systemd config file mentioned in step 1 by running the command
sudo systemctl cat <container_name>.service
Initially this PR (#10168) wants to revert the changes proposed by this: #8861. However, it did not fully revert all the changes.
How I did it
When hostcfgd started or was restarted, the Restart= field in each container's systemd configuration file should be initialized according to the value of auto_restart field in FEATURE table of CONFIG_DB.
How to verify it
I verified this change by running auto-restart test case against newly built master image and also ran the unittest:
The following commits are pushed
1f112b8 (HEAD -> 202205, origin/202205) [sonic-ycabled] fix grpc logic for timeout,cli HWSTATUS value retrival logic for active-active cable (#264)
Signed-off-by: vaibhav-dahiya vdahiya@microsoft.com
This fixes the build for armhf to be able to use '/device///installer.conf' files. Specifically, armhf needs support to be able to change the size of /var/log/ directory. It is hardcoded to 512 bytes on all armhf platforms currently. This change will allow any armhf platform to be able to use an installer.conf file to customize the installed image.
* [Tunnel PFC] Tests for adding property 'sai_remap_prio_on_tnl_egress'
Add tests for adding property 'sai_remap_prio_on_tnl_egress', this
property should only be added in dual tor environment.
Test done:
Run test test_j2files.py
Co-authored-by: richardyu <richardyu@contoso.com>
Why I did it
Make reset didn't clean-up all fsroot folders.
How I did it
Remove all fsroot folders used during build.
How to verify it
Run local build and local make reset:
sudo mkdir fsroot-test
sudo touch fsroot-test/foo
make reset
(Without this change, make reset cannot remove fsroot-foo, with the change, the repo become clean after make reset.)
Signed-off-by: Ying Xie ying.xie@microsoft.com
- Why I did it
1. SN2201 sai profile needs to be updated according to the latest hardware.
2. In the reboot script, need to use the common symbol link of the power_cycle sysfs instead of directly accessing it due to SN2201 sysfs is different than other platforms.
3. echo 1 > $SYSFS_PWR_CYCLE will trigger the reboot immediately, the following sleep 3 and echo 0 > $SYSFS_PWR_CYCLE will never be executed, can be removed.
- How I did it
1. Replace the SN2201 sai profile with the latest one.
2. In the platform_reboot script, replace the direct sysfs path with the symbol link path.
3. Remove the redundant code from platform_reboot
- How to verify it
Perform reboot on all the Nvidia platforms, and check all can be rebooted successfully.
Signed-off-by: Kebo Liu <kebol@nvidia.com>
- Why I did it
"import sonic_platform" takes about 600ms ~ 1000ms, it is kind of slow. After this optimization, the time is about 100ms. The benefit is that those CLIs which does not need the slow import sentence would be faster than before.
- How I did it
Find slow import and call them when need.
- How to verify it
Measure the import time.
Why I did it
Some PR test are timing out on T1-lag kvm test.
How I did it
Increase the timeout to 5 hours.
How to verify it
Test on this PR.
Signed-off-by: Ying Xie ying.xie@microsoft.com
* [Tunnel PFC] Add property for tunnel PFC
Replace the config.bcm file with j2 template file
- Add 'sai_remap_prio_on_tnl_egress=1' property when device metadata local
- Host subtype is 'dualtor'
- Change sai.profile foe the new config.bcm.j2
Including change:
* 7ff8f75 2022-06-03 | Revert "[portsorch]: Prevent LAG member configuration when port has active ACL binding (#2165)" (#2306) (HEAD -> 202205, github/202205) [bingwang-ms]
Signed-off-by: Ying Xie <ying.xie@microsoft.com>