Fix monit false alarm issue, which located in process_checker and it missed "disk-sleep" status check, thus some 201911 SONiC box report "pmon|sensord" error coincidently.
#### Why I did it
Currently psutil library returns below detail process status:
running: The process is currently running.
sleeping: The process is sleeping or waiting for an event to occur.
disk-sleep: The process is waiting for I/O operations to complete.
stopped: The process has been stopped (e.g. via the SIGSTOP signal).
zombie: The process has terminated but is still listed in the process table.
dead: The process has terminated and has been removed from the process table.
We should regard running/sleeping/disk-sleep as normal case and not alert in monit process.
Now once the disk-sleep occurs during monit cycle, below syslog will be paged, so get rid of syslog output meanwhile.
yslog.2.gz:Feb 24 06:12:17.394619 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:13:17.932531 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
syslog.2.gz:Feb 24 06:14:18.502505 MEL23-0101-0301-04T1 ERR monit[6040]: 'pmon|sensord' status failed (1) -- '/usr/sbin/sensord -f daemon' is not running in host
Then I tried to reproduce the issue by triggering process_checker for sensord frequently and observed it's under "disk-sleep" status once the alert is raised.
##### Work item tracking
- Microsoft ADO **(number only)**:17663589
#### How I did it
Fix process_checker script code for adding "disk-sleep" case handling.
#### How to verify it
Verified in local DUT.
Why I did it
Back port #6478 and #6519 to 201911 branch.
Work item tracking
Microsoft ADO (number only):
24978836
How I did it
Add checking the connection between zebra and bgp during bgpd start.
How to verify it
Modify start.h, add debug log and check the syslog
_Sep 22 02:41:29.716356 str-a7060cx-acs-10 INFO bgp#root: ####: start zebra
Sep 22 02:41:30.815341 str-a7060cx-acs-10 INFO bgp#root: ####: start check connection
Sep 22 02:41:30.868784 str-a7060cx-acs-10 INFO bgp#root: ####: It took 0.029979 seconds to wait for zebra to be ready to accept connections
Sep 22 02:41:30.873685 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpd
Sep 22 02:41:35.270569 str-a7060cx-acs-10 INFO bgp#root: ####: done_
_Sep 22 03:28:02.423438 str-a7060cx-acs-10 INFO bgp#root: ####: start zebra
Sep 22 03:28:03.731320 str-a7060cx-acs-10 INFO bgp#root: ####: start check connection
Sep 22 03:28:33.749152 str-a7060cx-acs-10 INFO bgp#root: ####: Error: zebra is not ready to accept connections
Sep 22 03:28:33.752490 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpd
Sep 22 03:28:34.259735 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpd done
Sep 22 03:28:34.755538 str-a7060cx-acs-10 INFO bgp#root: ####: start bgpcfgd
Sep 22 03:28:35.800906 str-a7060cx-acs-10 INFO bgp#root: ####: done_
ef2a0cd0 [201911] [multi_asic] Script to monitor errors on internal links (#2971)
1252e31b Changes to separate UT data for internal link monitor (#2976)
3e6654e [[201911] [multi-asic] Unit test fix for internal link monitoring (#2977)
Why I did it
Fix: #16086
faketime package url expired. It breaks 201911 build.
Update package url.
Work item tracking
Microsoft ADO (number only): 24930879
This PR makes two changes:
- Store Jinja2 cache in LOGLEVEL DB instead of STATE DB
- Store bytecode cache encoded in base64
Tested with the following command: "redis-dump -d 3 -k JINJA2_CACHE"
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
202211 and above uses different squashfs compression type that 201911 kernel can not handle. Therefore, we avoid mounting squashfs altogather with this change.
Why I did it
Some products might experience an occasional IO failure in the communication between CPU and SSD.
Based on some research it could be attributable to some device not handling ATA NCQ (Native Command Queue).
This issue currently affect 4 products:
DCS-7170-32C*
DCS-7170-64C
DCS-7060DX4-32
DCS-7260CX3-64
DCS-7050CX3-32S
How I did it
This change disable NCQ on the affected drive for a small set of products.
How to verify it
When the fix is applied, these 2 patterns can be found in the dmesg.
ata[0-9]+.00: FORCE: horkage modified (noncq)
NCQ (not used)
Test results using: fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
with NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (depth 32), AA)
READ: bw=33.9MiB/s (35.6MB/s), 33.9MiB/s-33.9MiB/s (35.6MB/s-35.6MB/s), io=4073MiB (4270MB), run=120078-120078msec
WRITE: bw=34.1MiB/s (35.8MB/s), 34.1MiB/s-34.1MiB/s (35.8MB/s-35.8MB/s), io=4100MiB (4300MB), run=120078-120078msec
without NCQ (ata1.00: 61865984 sectors, multi 1: LBA48 NCQ (not used))
READ: bw=31.7MiB/s (33.3MB/s), 31.7MiB/s-31.7MiB/s (33.3MB/s-33.3MB/s), io=3808MiB (3993MB), run=120083-120083msec
WRITE: bw=31.9MiB/s (33.4MB/s), 31.9MiB/s-31.9MiB/s (33.4MB/s-33.4MB/s), io=3830MiB (4016MB), run=120083-120083msec
Which release branch to backport (provide reason below if selected)
Change to use the snapshot mirror http://packages.trafficmanager.net/snapshot.
Warning: The Jessie distribution is EOL, please avoid to use it if you can. And the snapshot mirror will be removed in near future as well.
Why I did it
docker.com's gpg key start to work from 2023-02-23. While debian.org's gpg key expired in 2022-11.
We used a walkaround for security checking for debian gpg keys. Now we need to exclude docker.com's gpg key.
How I did it
Update docker.com's gpg key without faketime.
Update others' gpg key with faketime '2022-11'
How to verify it
#### Why I did it
To allow SSH connections from IPv6 addresses
Resolves https://github.com/Azure/sonic-buildimage/issues/7668
#### How I did it
In build_debian.sh, modify sshd_config file so as to enable listening for IPv6 connections
- Why I did it
Added BIOS upgrade infra
- How I did it
Added new make target
- How to verify it
Copy msn3800_bios.tar.gz to platform/mellanox/bios
make configure PLATFORM=mellanox
make target/files/stretch/msn3800_bios.tar.gz
Signed-off-by: Nazarii Hnydyn <nazariig@nvidia.com>
#### Why I did it
The GPG key used for Jessie's official repos has since expired, which means building 201911 images no longer works.
#### How I did it
Fake the time to be before the expiry date.
* Fix to improve hostname handling
If config_db.json is missing hostname entry, hostname-config.sh ends
up deleting existing entry too and hostname changes to default 'localhost'
* default hostname to 'sonic` if missing in config file
Upgrade systemd to fix timer elapsed issue.
#### Why I did it
On 201911 release, snmp.timer become elapsed status and snmp.service will not be trigger by snmp.timer:
● snmp.service - SNMP container
Loaded: loaded (/usr/lib/systemd/system/snmp.service; static; vendor preset: enabled)
Active: inactive (dead)
● snmp.timer - Delays snmp container until SONiC has started
Loaded: loaded (/usr/lib/systemd/system/snmp.timer; enabled; vendor preset: enabled)
Active: active (elapsed) since Wed 2022-08-03 18:12:59 UTC; 2 months 17 days ago
This issue caused by systemd bug: https://github.com/systemd/systemd/pull/10778/files
This issue can be reproduce with following steps:
1. reboot system.
2. continusly run following commands till timer elapsed:
systemctl status snmp.timer
sudo systemctl daemon-reload
#### How I did it
Install latest version systemd from offical backport source.
#### How to verify it
Pass all test case.
Manually check reproduce steps, verify the issue fixed.
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
#### Description for the changelog
Upgrade systemd to fix timer elapsed issue.
#### Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Certain platform specific packages sonic-platform-xyz, installs files onto rootfs, which would be placed on read-write mount path on /host/image-name/rw/...
when ntpd starts it tries to do read access on /usr/bin /usr/sbin/ /usr/local/bin , which inturn links further to the read-write mount path also.
Where ntpd would get below Apparmor Warning message
LOG:-
audit: type=1400 audit(1606226503.240:21): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/image-HEAD-dirty-20201111.173951/rw/usr/local/bin/" pid=3733 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
audit: type=1400 audit(1606226503.240:22): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/image-HEAD-dirty-20201111.173951/rw/usr/sbin/" pid=3733 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
audit: type=1400 audit(1606226503.240:23): apparmor="DENIED" operation="open" profile="/usr/sbin/ntpd" name="/image-HEAD-dirty-20201111.173951/rw/usr/bin/" pid=3733 comm="ntpd" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Fix:
Add rw/.. mount path similar to root path access provided for ntpd in /etc/apparmor.d/usr.sbin.ntpd
Signed-off-by: Antony Rheneus <arheneus@marvell.com>
Why I did it
Change the path of sonic submodules that point to "Azure" to point to "sonic-net"
How I did it
Replace "Azure" with "sonic-net" on all relevant paths of sonic submodules
Porting https://github.com/sonic-net/sonic-buildimage/pull/3723 to 201911
#### Why I did it
Extend Mellanox FW utils with CPLD update feature
Added support for CPLD upgrade to Mellanox FW utility
#### How I did it
Updated Mellanox FW utility
#### How to verify it
mlnx-fw-upgrade.sh --upgrade --cpld # Regular CPLD update flow
UPDATE_MLNX_CPLD_FW=1 mlnx-fw-upgrade.sh --upgrade # Force CPLD refresh only
#### Ensure to add label/tag for the feature raised. example - [PR#2174](https://github.com/sonic-net/sonic-utilities/pull/2174) where, Generic Config and Update feature has been labelled as GCU.
Cherry-pick the commit from master where in multi-asic platforms bgp template rendering fails which needs Loopback4096 IP Address. Issue happens because of timing/race condition where if peer gets added first and then Loopback4096 notification comes to bgpcfgd
- Why I did it
Collecting MST dump before syncd restart on shutdown notification during a SAI failure
Dump can be found under:
root@sonic:/home/admin# ls -l /var/dump/mstdump/
total 10684
-rw-r--r-- 1 root root 5460332 Aug 15 18:41 mstdump_20220815_184143.tar.gz
-rw-r--r-- 1 root root 5473253 Aug 15 21:46 mstdump_20220815_214642.tar.gz
root@sonic:/home/admin# tar -xvzf /var/dump/mstdump/mstdump_20220815_214642.tar.gz
├── ir-gdb
│ └── core
└── mstdump
├── mstdump1
├── mstdump2
├── mstdump3
└── mststatus
- How I did it
Checked for shutdown notification log in sairedis and used it to determine whether the shutdown is normal or due to SAI failure
- How to verify it
Simulated a SAI failure event and verified it. Verified it also on different reboots and config reload scenarios the dump is not generated
Signed-off-by: Vivek Reddy <vkarri@nvidia.com>
Why I did it
Add the hardware reboot cause when the previous software reboot failed
How I did it
Check both hardware reboot cause and software reboot cause.
Add the hardware reboot as actual reboot cause
if any hardware reboot cause is available for any software reboot.
How to verify it
Perform reboots and verify the reboot-cause
Why I did it
Add Celestica Silverstone-x platform
How I did it
Add Celestica Silverstone-x platform
How to verify it
verified by SONiC tested platform APIs
verified by SONiC APIs including " psuutil
psushow(show platform psustatus)
sfputil
sfpshow
tempershow(show platform temperature)
fanshow(show platform fan)
watchdogutil
fwutil(show platform firmware status)
decode-syseeprom -d(show platform syseeprom)
show platform ssdhealth
show platform summary
show interfaces status
"