From 5c7c789b4ac750ac4f32d041201eefe28148321f Mon Sep 17 00:00:00 2001 From: Saikrishna Arcot Date: Tue, 29 Nov 2022 16:48:23 -0800 Subject: [PATCH] Fix error handling when failing to install a deb package (#11846) (#12857) Signed-off-by: Saikrishna Arcot sarcot@microsoft.com Why I did it The current error handling code for when a deb package fails to be installed currently has a chain of commands linked together by && and ends with exit 1. The assumption is that the commands would succeed, and the last exit 1 would end it with a non-zero return code, thus fully failing the target and causing the build to stop because of bash's -e flag. However, if one of the commands prior to exit 1 returns a non-zero return code, then bash won't actually treat it as a terminating error. From bash's man page: -e Exit immediately if a pipeline (which may consist of a single simple command), a list, or a compound command (see SHELL GRAMMAR above), exits with a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test following the if or elif reserved words, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command's return value is being inverted with !. If a compound command other than a subshell returns a non-zero status because a command failed while -e was being ignored, the shell does not exit. The part part of any command executed in a && or || list except the command following the final && or || says that if the failing command is not the exit 1 that we have at the end, then bash doesn't treat it as an error and exit immediately. Additionally, since this is a compound command, but isn't in a subshell (subshell are marked by ( and ), whereas { and } just tells bash to run the commands in the current environment), bash doesn't exist. The result of this is that in the deb-install target, if a package installation fails, it may be infinitely stuck in that while-loop. This was seen when the snmpd package upgrade happened, and builds were failing to install the mismatching libsnmp-dev package, the builds did not immediately terminate; instead, the installation was retried again and again, suggesting it was stuck in some infinite loop. The build jobs finally terminated only because of the timeout specified for the jobs. How I did it There are two fixes for this: change to using a subshell, or use ; instead of &&. Using a subshell would, I think, require exporting any shell variables used in the subshell, so I chose to change the && to ;. In addition, at the start of the subshell, set +e is added in, which removes the exit-on-error handling of bash. This makes sure that all commands are run (the output of which may help for debugging) and that it still exits with 1, which will then fully fail the target. How to verify it --- slave.mk | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slave.mk b/slave.mk index 9923c6f3a2..15fadcffca 100644 --- a/slave.mk +++ b/slave.mk @@ -683,7 +683,7 @@ $(SONIC_INSTALL_DEBS) : $(DEBS_PATH)/%-install : .platform $$(addsuffix -install { while dpkg -s $(firstword $(subst _, ,$(basename $(deb)))) | grep "^Version: $(word 2, $(subst _, ,$(basename $(deb))))" &> /dev/null; do echo "waiting for $(deb) to be uninstalled" $(LOG); sleep 1; done } ) # put a lock here because dpkg does not allow installing packages in parallel if mkdir $(DEBS_PATH)/dpkg_lock &> /dev/null; then - { sudo DEBIAN_FRONTEND=noninteractive dpkg -i $(DEBS_PATH)/$* $(LOG) && rm -d $(DEBS_PATH)/dpkg_lock && break; } || { rm -d $(DEBS_PATH)/dpkg_lock && sudo lsof /var/lib/dpkg/lock-frontend && ps aux && exit 1 ; } + { sudo DEBIAN_FRONTEND=noninteractive dpkg -i $(DEBS_PATH)/$* $(LOG) && rm -d $(DEBS_PATH)/dpkg_lock && break; } || { set +e; rm -d $(DEBS_PATH)/dpkg_lock; sudo lsof /var/lib/dpkg/lock-frontend; ps aux; exit 1 ; } fi sleep 10 done