A few issues where discovered with crashkernel on Arista platforms.
1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.
2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.
3) The secureboot path allowlist would remove kernel crash reports.
4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`
**- How I did it**
1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.
2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.
3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`
4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.
**- How to verify it**
Regular boot:
- enable kdump
- enable docker_inram=on via kernel-params
- reboot
- generate a crash `echo c > /proc/sysrq-trigger`
- before: witness OOM events on the console
- after: crash kernel works and crash available under /var/crash
Secure boot:
- enable kdump
- reboot
- generate a crash `echo c > /proc/sysrq-trigger`
- before: witness no kdump
- after: crash kernel works and crash available under /var/crash
Co-authored-by: Boyang Yu <byu@arista.com>
The first partition starting point was changed to be 1M as part of this
commit: 6ba2f97f1e. On systems that are misaligned before conversion
(partition start is the first sector), the relica partition that is
left in the first MB can cause problems in Aboot and result in corruption
of the filesystem on the new aligned partition.
Zeroing this old relica makes sure that there is nothing left of the old
partition lying around. There won't be any risk of having Aboot corrupt
the new filesystem because of the old relica.
Signed-off-by: Baptiste Covolato <baptiste@arista.com>
Fix the issue of incorrectly skipping the convertfs hook when fast-reboot from EOS, by adding an extra kernel cmdline param "prev_os" to differentiate fast-reboot from EOS and from SONiC.
This is because we still do disk conversion for fast reboot from eos to sonic, like format the disk.
when device disk is small, do not unzip dockerfs.tar.gz on disk.
keep the tar file on the disk, unzip to tmpfs in the initrd phase.
enabled this for 7050-qx32
Signed-off-by: Guohan Lu <gulv@microsoft.com>
* Update arista drivers submodule
* Ignore the possible timestamp warning in tar extraction
* Add verbosity toggle to boot0
Console logging is slow because of the 9600 baud rate.
Some time can be saved by decreasing the console verbosity.
* Add hook mechanism in boot0.
Support additional features in boot0 via hooks.
Hooks are unpacked and executed at post-install or pre-exec time.
* Fix 7170 sensors.conf file
Fix critical temperature settings for MAX6658 sensors
* Fix the random swap of storage devices
For arista 7050 switches running with linux 4.9, it is likely the device
name of flash drive (/dev/sda) and usb (/dev/sdb) randomly swap in kernel
booting, depending on which one is ready first. It breaks the expectation
that flash will be mounted as root by setting root=/dev/sda1. This patch
will correct ROOT to flash device refering to the path under block_flash.
* Fix 7170 fancontrol
* Do not remove aquota.user file in boot0
This file is a filesystem protected file used by EOS.
It can be simply removed and will make the SONiC installation failed if
not skipped.
Flashes used for the 7050QX-32 and 7050QX-32S have a fw issue.
The best option to solve the problem is to upgrade to a newer firmware.
However this can only be done while in memory and take 10 seconds.
Adding an upgrade mechanism is possible but would need more
consideration as flashing the firmware and reformating the flash will
exceed the fast-reboot requirements.
A quick mitigation is to align the ext4 partition that we create on
these vfat based system on a 4k boundary.
Here we chose 1M instead but it's the same.
Newer version of sfdisk do this automatically but the one in SONiC
today doesn't have this behavior.
This workaround will only reduce the pace of the flash health
degradation. The only long term fix is to flash the firmware.
* Update sensors.conf for 7050QX-32 and 7050QX-32S
These two platforms were using a previous version of a kernel driver.
The new one names the i2c buses differently.
We therefore need to rename them here.
* Fix the default minigraph for the 7050QX-32S
The interface offset is invalid which makes sonic-cfggen generate an
invalid config_db.jon in rc.local.
This config then silently makes orchagent/syncd fail.
* Use the partition on which sonic-aboot.swi is
Instead of always assuming /mnt/flash, use the partition where the image
to be installed lies.
This allow for the image to be on any partition.
* Bump sonic-platform-modules-arista
Improves i2c performance for xcvrs
Fix the led_plugin by ignoring unknown ports
Miscellaneous improvements
* Fix index column for Arista-7260CX3-D108C8
* Fix flash permissions for Arista platforms
The ext4 flash uses acl to properly handle permissions in EOS.
Aboot isn't built with this support and therefore can't be used
to set the flash permissions. It has to be deferred in sonic initrd.
moving to initramfs unifies disk allocate on different platforms.
use fallocate instead of dd to speed up the disk allocation.
By default, mkfs.ext4 has -E discard option which discards the blocks
at the mkfs time, also speed up the initialization time.
* Bump sonic-platform-modules-arista submodule
* Use sonic_sfputil plugin from the arista library
* Fix undefined variable varlog_size
* Prevent minigraph.xml to be removed from the flash
* Update DCS-7050QX-32 sensors config