Writing shell scripts should be a must-have skill for programmers. Because of its simplicity and ease of use, we often use it in our daily work to automate application testing and deployment, environment cleanup, and so on. In fact, when writing and running shell scripts, there are various pitfalls that can cause shell scripts to not execute properly for various reasons if you are not careful. In fact, there are many tricks to writing robust and reliable shell scripts, so let’s explore them today.

Set the default execution environment parameters of the Shell

When executing a shell script, a new shell is usually created, for example, when we execute.

1
bash script.sh

We specify that using bash creates a new shell to execute script.sh, given the default parameters of this execution environment. The set command can be used to modify the runtime parameters of the shell environment. The set command, without any parameters, will display all environment variables and shell functions. For all the customizable runtime parameters, see the official manual, and we will focus on the four most commonly used ones.

Tracking command execution

By default, shell scripts only display the run result after execution, and do not show which line of code the result is output from. If multiple commands are executed consecutively, their run results are output consecutively, making it difficult to tell what command produced a string of results. set -x is used to output the line of command executed before the result, with + at the beginning of the line to indicate that it is a command instead of a command output, and the arguments of each command will be expanded so that we can clearly see the running parameters of each command, which is very friendly for shell script debugging.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/bash
set -x

v=5
echo $v
echo "hello"

# output:
# + v=5
# + echo 5
# 5
# + echo hello
# hello

There is another way to write set -x: set -o xtrace.

Command execution failure should report an error

In fact, unlike other high-level languages such as Python, Ruby, etc., shell scripts do not provide security mechanisms by default. For example, a Ruby script will report an error when it tries to read the contents of an uninitialized variable, while a shell script will not be prompted by default and will simply ignore it.

1
2
3
4
5
6
7
8
#!/bin/bash

echo $v
echo "hello"

# output:
#
# hello

As you can see, echo $v outputs a blank line, and bash completely ignores the non-existent $v and continues with the subsequent command echo "hello". This is not really the behavior the developer wants, and the script should report errors and stop execution for non-existent variables to prevent errors from piling up. The good thing is that we can change this default behavior of ignoring undefined variables by using set -u. The script adds it to the header, and when it encounters a non-existent variable it will report an error and stop execution.

1
2
3
4
5
6
7
8
#!/bin/bash
set -u

echo $a
echo bar

# output:
# ./script.sh: line 4: v: unbound variable

Another way to write set -u is set -o nounset.

Command execution should stop when it fails

For the default shell script runtime environment, with a failed command (with a non-zero return value), bash will continue executing the subsequent commands.

1
2
3
4
5
6
7
8
#!/bin/bash

unknowncmd
echo "hello"

# output:
# ./script.sh: line 3: unknowncmd: command not found
# hello

As you can see, bash just shows an error and continues executing the shell script, which is not good for script safety and troubleshooting. In practice, if a command fails, it is often necessary to stop the script to prevent errors from accumulating. In this case, the following is generally used.

1
command || exit 1

This means that the shell script will stop whenever command has a non-zero return value. If more than one operation needs to be completed before execution can be stopped, the following three more advanced ways of writing are used.

1
2
3
4
5
6
7
8
9
# option 1
command || { echo "command failed"; exit 1; }

# option 2
if ! command; then echo "command failed"; exit 1; fi

# option 3
command
if [ "$?" -ne 0 ]; then echo "command failed"; exit 1; fi

In addition, it is easy to associate another very similar use, if two commands have an inheritance relationship, and only if the first command succeeds can the second command continue to be executed, then it can be written in the following way.

1
command1 && command2

But these techniques are somewhat cumbersome and easy to overlook. set -e solves this problem at the root, by making the script terminate whenever an error occurs.

1
2
3
4
5
6
7
8
#!/bin/bash
set -e

unknowncmd
echo "hello"

# output:
# ./script.sh: line 4: unknowncmd: command not found

As you can see, the script terminates after line 4 fails to execute. set -e determines whether a command has failed to run based on the return value of the command. However, the non-zero return value of some commands may not indicate failure, or the developer may want the script to continue execution if the command fails.

1
2
3
4
5
6
7
8
#!/bin/bash
set -e

$(ls foobar)
echo "hello"

# output:
# ls: cannot access 'foobar': No such file or directory

As you can see, after turning on set -e, even though ls is an existing command, it returns a non-zero value because foobar, the runtime parameter of the ls command, does not actually exist, which is sometimes not what we want to see.

You can temporarily close set -e and reopen set -e after the command has finished executing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#!/bin/bash
set -e

set +e
$(ls foobar)
set -e

echo "hello"

# output:
# ls: cannot access 'foobar': No such file or directory
# hello

In the above code, set +e means turn off the -e option and set -e means turn back on the -e option.

There is another way to write it that serves a similar purpose.

1
command || true

The above command command will not terminate the script even if it fails to execute.

There is another way to write set -e: set -o errexit.

Controlling the execution of pipeline commands

One exception to the set -e described in the previous section is that it does not apply to pipe commands. For pipeline commands, bash takes the return value of the last subcommand as the return value of the entire command. That is, as long as the last subcommand does not fail, the pipe command will always execute successfully, so its subsequent commands will still be executed, so set -e will not work. As an example.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/bash
set -e

foo | echo "bar"
echo "hello"

# output:
# ./script.sh: line 4: foo: command not found
# bar
# hello

As you can see, even though foo is a non-existent command, the pipefail command foo | echo bar will still execute successfully, causing the subsequent echo hello to continue to execute.

set -o pipefail is used to resolve this situation, as long as a subcommand fails, the entire pipeline command will fail and the script will terminate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/bin/bash
set -e
set -o pipefail

foo | echo "bar"
echo "hello"

# output:
# ./script.sh: line 5: foo: command not found
# bar

As you can see, the foo | echo bar pipeline command fails the entire shell script exit, and the subsequent echo hello command is not executed.

Merging Shell Default Execution Environment Parameters

For the four set command parameters mentioned above, they are generally used together.

1
2
3
4
5
6
set -euxo pipefail

# or

set -eux
set -o pipefail

Either of these two writings is placed in the head of all shell scripts.

Of course, these parameters can also be passed from the bash command line during the execution of the shell script.

1
bash -euxo pipefail script.sh

Shell scripting defensive programming

Shell scripts should be written with unanticipated program input in mind, such as files that do not exist or directories that are not created successfully. Shell commands actually have many options to solve such problems. For example, when creating a directory with mkdir, mkdir returns an error by default if the parent directory does not exist, but with the -p option, mkdir creates the parent directory first if it does not exist; rm fails to delete a non-existent file, but with the -f option, it executes successfully even if the file cannot exist.

Beware of spaces in strings

We must always be aware of spaces in strings, such as spaces in filenames, spaces in command parameters, etc. The best practice for the safety of these spaces is to use quotes to enclose the corresponding string.

1
2
3
4
5
6
# will fail if $filename contains spaces
if [ $filename = "foo" ];


# will success even if $filename contains spaces
if [ "$filename" = "foo" ];

Similarly, when using $@ or other strings containing multiple strings separated by spaces, we should also be careful to enclose the corresponding variables in quotes; in fact, enclosing the corresponding variables in quotes has no side effects and only makes our shell scripts more robust.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# will split the string paramter if parameter contains spaces
foo() { for i in $@; do printf "%s\n" "$i"; done }; foo bar "baz quux"
bar
baz
quux

# will not split the string paramter if parameter contains spaces
foo() { for i in "$@"; do printf "%s\n" "$i"; done }; foo bar "baz quux"
bar
baz quux

Use trap command more often to catch signals

Another common scenario regarding shell scripts is when a script execution fails and leaves the filesystem in an inconsistent state, such as a file lock, a temporary file, or a shell script error that updates only part of the file. In order to achieve “transaction integrity” we need to resolve these inconsistencies by either removing the file locks, temporary files, or restoring the state to what it was before the update. In fact, shell scripts do provide the ability to execute a command or function if a specific unix signal is caught.

Shell scripts can actually catch many types of signals (the full list of signals can be obtained with the kill -l command), but we usually only care about the three signals used to recover the scene after a problem has occurred: INT, TERM and EXIT.

Signal Description
INT Interrupt - this signal is sent when someone kills the script by pressing ctrl-c
TERM Terminate - this signal is sent when someone sends the TERM signal using the kill command
EXIT Exit - this is a pseudo-signal and is triggered when your script exits, either through reaching the end of the script, an exit command or by a command failing when using set -e

In general, we create a file lock before manipulating the corresponding share.

1
2
3
4
5
6
7
if [ ! -e $lockfile ]; then
    touch $lockfile
    critical-section
    rm $lockfile
else
    echo "critical-section is already running"
fi

However, if someone manually kills the shell process while the shell script is operating on the corresponding share, the presence of the file lock will prevent the shell script from operating on the corresponding share again. Using the trap command, we can catch the corresponding signal and restore it accordingly.

1
2
3
4
5
6
7
8
9
if [ ! -e $lockfile ]; then
    trap "rm -f $lockfile; exit" INT TERM EXIT
    touch $lockfile
    $lockfile
    rm $lockfile
    trap - INT TERM EXIT
else
    echo "critical-section is already running"
fi

With the trap command above, the file lock will be cleaned up even if someone manually kills the corresponding shell process while the shell script is operating on the corresponding share. Note that we exit directly after deleting the file lock after catching the signal instead of continuing execution.

1
2
3
for file in $(find /var/www -type f -name "*.html"); do
    perl -pi -e 's/www.example.org/www.example.com/' $file
done

The correct approach is to make update operations as atomic as possible to achieve “transaction consistency”:

  1. copy the old directory.
  2. perform the update operation in the copied directory.
  3. replace the original directory.
1
2
3
4
5
6
cp -a /var/www /var/www-tmp
for file in $(find /var/www-tmp -type f -name "*.html"); do
   perl -pi -e 's/www.example.org/www.example.com/' $file
done
mv /var/www /var/www-old
mv /var/www-tmp /var/www

It is very fast to perform the last two mv operations on a Unix-like file system (because only the inode of two directories need to be replaced without performing the actual copy operation), in other words, the error-prone area is the bulk update operation, and we perform all the update operations in the copied directory so that the update operation, even if it goes wrong, does not affect the original directory. The trick here is to use double the hard disk space for the operation, and any operation that is required to open the file for a long time is performed in the backup directory. In fact, keeping a series of operations atomic is very important for some error-prone shell scripts, and it is also a good programming habit to back up files before operating.