Fix return code when backup to remote rgw fails

In the database backup framework (_backup_main.sh.tpl), the
backup_databases function exits with code 1 if the store_backup_remotely
function fails to send the backup to the remote RGW. This causes the pod
to fail and be restarted by the cronjob, over and over until the backoff
retries limit (6 by default) is reached, so it creates many copies of
the same backup on the file system, and the default k8s behavior is to
delete the job/pods once the backoff limit has been exceeded, so it then
becomes more difficult to troubleshoot (although we may have logs in
elasticsearch). This patch changes the return code to 0 so that the pod
will not fail in that scenario. The error logs generated should be
enough to flag the failure (via Nagios or whatever alerting system is
being used).

Change-Id: Ie1c3a7aef290bf6de4752798821d96451c1f2fa5
This commit is contained in:
Cliff Parsons 2020-06-30 16:22:08 +00:00
parent b1e66fd308
commit 1508324ce7

View File

@ -346,7 +346,10 @@ backup_databases() {
echo "Backup archive size: $ARCHIVE_SIZE"
echo "=================================================================="
set -x
exit 1
# Because the local backup was successful, exit with 0 so the pod will not
# continue to restart and fill the disk with more backups. The ERRORs are
# logged and alerting system should catch those errors and flag the operator.
exit 0
fi
#Only delete the old archive after a successful archive