sonic-buildimage/files/initramfs-tools/union-mount.j2
Samuel Angebault 0c4d4ace76
[kdump] Fix OOM events in crashkernel (#6447)
A few issues where discovered with crashkernel on Arista platforms.

1) platforms using `docker_inram=on` would end up OOM in kdump environment.
This happens because the same initramfs is used by SONiC and the crashkernel.
With `docker_inram=on` the `dockerfs.tar.gz` is extracted in a `tmpfs` created for the occasion.
Since `dockerfs.tar.gz` weights more than 1.5G, it doesn't fit into the kdump environment and ends up OOM.
This OOM event can in turn trigger a panic.

2) Arista platforms with `secureboot` enabled would fail to load the crashkernel because the kernel parameter would be discarded on boot.
This happens because the `boot0` in secureboot mode is strict about kernel parameter injection.

3) The secureboot path allowlist would remove kernel crash reports.

4) The kdump service would fail on Arista products since `/boot/` is empty in `secureboot`

**- How I did it**

1) To prevent an OOM event in the crashkernel the fix is to avoid the codepaths in `union-mount` that create tmpfs and populate them. Some more codepath specific to Arista devices are also skipped to make the kdump process faster.
This relies on detecting that the initramfs is starting in a kdump environment and skipping some initialization.
The `/usr/sbin/kdump-config` tool appends a few kernel cmdline arguments when loading the crashkernel.
The most unique one is `systemd.unit=kdump-tools.service` which is used in a few initramfs hooks to set `in_kdump`.

2) To allow `kdump` to work in `secureboot` environment the cmdline generation in boot0 was slightly modified.
The codepath to load kernel parameters changed by SONiC is now running for booting in secure mode.
It was altered to prevent an append only behavior which would grow the `kernel-cmdline` at every reboot.
This ever growing behavior would lead `kexec` to fail to load the kernel due to a too long cmdline.

3) To get the kernel crash under /var/crash this path has to be added to `allowlist_paths`

4) The `/host/image-XXX/boot` folder is now populated in `secureboot` mode but not used.

**- How to verify it**

Regular boot:
 - enable kdump
 - enable docker_inram=on via kernel-params
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness OOM events on the console
 - after: crash kernel works and crash available under /var/crash

Secure boot:
 - enable kdump
 - reboot
 - generate a crash `echo c > /proc/sysrq-trigger`
 - before: witness no kdump
 - after: crash kernel works and crash available under /var/crash


Co-authored-by: Boyang Yu <byu@arista.com>
2021-02-02 01:55:09 -08:00

170 lines
5.7 KiB
Django/Jinja

#!/bin/sh -e
PREREQS="varlog"
prereqs() { echo "$PREREQS"; }
case $1 in
prereqs)
prereqs
exit 0
;;
esac
docker_inram=false
logs_inram=false
secureboot=false
bootloader=generic
in_kdump=false
# Extract kernel parameters
for x in $(cat /proc/cmdline); do
case "$x" in
Aboot=*)
bootloader=aboot
;;
docker_inram=on)
docker_inram=true
;;
logs_inram=on)
logs_inram=true
;;
secure_boot_enable=[y1])
secureboot=true
docker_inram=true
;;
platform=*)
platform_flag="${x#platform=}"
;;
systemd.unit=kdump-tools.service)
in_kdump=true
;;
esac
done
set_tmpfs_log_partition_size()
{
varlogsize=128
# set varlogsize to existing var-log.ext4 size
if [ -f ${rootmnt}/host/disk-img/var-log.ext4 ]; then
varlogsize=$(ls -l ${rootmnt}/host/disk-img/var-log.ext4 | awk '{print $5}')
varlogsize=$(($varlogsize/1024/1024))
fi
# make sure varlogsize is between 5% to 10% of total memory size
memkb=$(grep MemTotal /proc/meminfo | awk '{print $2}')
memmb=$(($memkb/1024))
minsize=$(($memmb*5/100))
maxsize=$(($memmb*10/100))
[ $minsize -ge $varlogsize ] && varlogsize=$minsize
[ $maxsize -le $varlogsize ] && varlogsize=$maxsize
}
remove_not_in_allowlist_files()
{
local allowlist_file="$1"
local targeted_dir="$2"
local allowlist_pattern_file=/tmp/allowlist_paths.pattern
# Return if the allowlist file does not exist
if ! test -f "${allowlist_file}"; then
echo "The file ${allowlist_file} is missing, failed to mount rw folder." 1>&2
exit 1
fi
# Set the grep pattern file, remove the blank line in config file
awk -v rw_dir="$targeted_dir" 'NF {print rw_dir"/"$0"$"}' ${allowlist_file} > $allowlist_pattern_file
# Find the files in the rw folder, and remove the files not in the allowlist
find ${targeted_dir} -type f | grep -v -f $allowlist_pattern_file | xargs /bin/rm -f
rm -f $allowlist_pattern_file
}
## Mount the overlay file system: rw layer over squashfs
image_dir=$(cat /proc/cmdline | sed -e 's/.*loop=\(\S*\)\/.*/\1/')
rw_dir=${rootmnt}/host/$image_dir/rw
work_dir=${rootmnt}/host/$image_dir/work
mkdir -p "$rw_dir"
mkdir -p "$work_dir"
## Remove the files not in allowlist in the rw folder
if [ "$secureboot" = true ] && [ "$in_kdump" = false ]; then
if [ "$bootloader" = "aboot" ]; then
swi_path="${rootmnt}/host/$(sed -E 's/.*loop=([^ ]+).*/\1/' /proc/cmdline)"
unzip -q "$swi_path" allowlist_paths.conf -d /tmp
allowlist_file=/tmp/allowlist_paths.conf
else
allowlist_file=${rootmnt}/host/$image_dir/allowlist_paths.conf
fi
remove_not_in_allowlist_files "$allowlist_file" "$rw_dir"
## Remove the executable permission for all the files in rw folder except home folder
find ${rw_dir} -type f -not -path ${rw_dir}/home -exec chmod a-x {} +
fi
mount -n -o lowerdir=${rootmnt},upperdir=${rw_dir},workdir=${work_dir} -t overlay root-overlay ${rootmnt}
## Check if the root block device is still there
[ -b ${ROOT} ] || mdev -s
case "${ROOT}" in
ubi*)
mtd=$(cat /proc/cmdline | sed -e 's/.*ubi.mtd=\([0-9]\) .*/\1/')
if [ ! -f /dev/${ROOT}_0 ]; then
ubiattach /dev/ubi_ctrl -m $mtd 2>dev/null || true
fi
mount -t ubifs /dev/${ROOT}_0 ${rootmnt}/host
;;
*)
## Mount the raw partition again
mount ${ROOT} ${rootmnt}/host
;;
esac
mkdir -p ${rootmnt}/var/lib/docker
if [ "$in_kdump" = false ]; then
if [ "$secureboot" = true ]; then
mount -t tmpfs -o rw,nodev,size={{ DOCKER_RAMFS_SIZE }} tmpfs ${rootmnt}/var/lib/docker
if [ "$bootloader" = "aboot" ]; then
unzip -qp "$swi_path" dockerfs.tar.gz | tar xz --numeric-owner -C ${rootmnt}/var/lib/docker
## Boot folder is not extracted during secureboot since content would inherently become unsafe
mkdir -p ${rootmnt}/host/$image_dir/boot
else
echo "secureboot unsupported for bootloader $bootloader" 1>&2
exit 1
fi
elif [ -f ${rootmnt}/host/$image_dir/{{ FILESYSTEM_DOCKERFS }} ]; then
## mount tmpfs and extract docker into it
mount -t tmpfs -o rw,nodev,size={{ DOCKER_RAMFS_SIZE }} tmpfs ${rootmnt}/var/lib/docker
tar xz --numeric-owner -f ${rootmnt}/host/$image_dir/{{ FILESYSTEM_DOCKERFS }} -C ${rootmnt}/var/lib/docker
else
## Mount the working directory of docker engine in the raw partition, bypass the overlay
mount --bind ${rootmnt}/host/$image_dir/{{ DOCKERFS_DIR }} ${rootmnt}/var/lib/docker
fi
fi
## Mount the boot directory in the raw partition, bypass the overlay
mkdir -p ${rootmnt}/boot
mount --bind ${rootmnt}/host/$image_dir/boot ${rootmnt}/boot
## Mount loop device or tmpfs for /var/log
if $logs_inram; then
# NOTE: some platforms, when reaching initramfs stage, have a small
# limit of mounting tmpfs partition, potentially due to amount
# of RAM available in this stage. e.g. Arista 7050-qx32[s] and 7060-cx32s
set_tmpfs_log_partition_size
mount -t tmpfs -o rw,nosuid,nodev,size=${varlogsize}M tmpfs ${rootmnt}/var/log
[ -f ${rootmnt}/host/disk-img/var-log.ext4 ] && rm -rf ${rootmnt}/host/disk-img/var-log.ext4
else
[ -f ${rootmnt}/host/disk-img/var-log.ext4 ] && fsck.ext4 -v -p ${rootmnt}/host/disk-img/var-log.ext4 2>&1 \
| gzip -c >> /tmp/fsck.log.gz
[ -f ${rootmnt}/host/disk-img/var-log.ext4 ] && mount -t ext4 -o loop,rw ${rootmnt}/host/disk-img/var-log.ext4 ${rootmnt}/var/log
fi
## fscklog file: /tmp will be lost when overlayfs is mounted
if [ -f /tmp/fsck.log.gz ]; then
mv /tmp/fsck.log.gz ${rootmnt}/var/log
fi