bash / postfix health check / dev tcp
Using /dev/tcp in bash for a health check? Here's an example.
I had a script that used netcat to connect to a Postfix email daemon to check its health status. To avoid pipelining errors I had it sleep between each write. The core looked somewhat like this:
messages=$(for x in \
'EHLO localhost' \
'MAIL FROM:<healthz@localhost>' \
'RCPT TO:<postmaster@example.com>' \
RSET \
QUIT
do sleep 0.1; echo "$x"; done |
nc -v $HOST 25 2>&1 |
tr -d '\r')
This works, but the sleeps make it slower than necessary, and more brittle. If the daemon is temporarily slow, we can trigger a Postfix SMTP command pipelining error anyway.
Ideally, we want to read the responses, and act on them immediately instead.
Here's a script that uses bash instead of POSIX sh because bash
has /dev/tcp
support, which makes doing networking I/O easier.
Starting bash might be slightly costlier than starting a smaller POSIX sh like dash. But we avoid calling netcat and some other tools, so we win out not only in speed but also in resource usage.
#!/bin/bash
# Using bash (not POSIX sh) for /dev/tcp I/O.
: ${BASH_VERSION:?Use bash, not POSIX-sh}
set -u
DEBUG=0
READ_TIMEOUT=10
CHAR_CR=$'\r'
#HOST=mail.example.com
#HOST=1.2.3.4
HOST=127.0.0.1
PROXY_PROTOCOL=
if [ "$HOST" = 127.0.0.1 ]; then
# This should succeed in mere milliseconds. But sometimes postfix
# decides to take just over a second for the RCPT TO check.
READ_TIMEOUT=3
PROXY_PROTOCOL='PROXY TCP4 127.0.0.1 127.0.0.1 12345 25' # (or empty)
fi
getresp() {
local line status
while :; do
read -t $READ_TIMEOUT -r line
[ $DEBUG -ne 0 ] && printf '%s\n' "<<< $line" >&2
printf '%s\n' "${line%$CHAR_CR}"
test -z "$line" && exit 65
status=${line:0:3}
if [ $status -lt 200 -o $status -ge 300 ]; then
exit 66
elif [ "${line:3:1}" = ' ' ]; then
break # "250 blah"
elif [ "${line:3:1}" = '-' ]; then
true # "250-blah" (continue)
else
exit 67
fi
done
}
if ! exec 3<>/dev/tcp/$HOST/25; then # open fd
# Takes a looooot of time. 2m10 in the test case. You will want to wrap
# this script in a timeout(1) call.
# $0: connect: Connection timed out
# $0: line 40: /dev/tcp/1.2.3.4/25: Connection timed out
exit 1
fi
messages=$(for x in \
"$PROXY_PROTOCOL" \
'EHLO localhost' \
'MAIL FROM:<healthz@localhost>' \
'RCPT TO:<postmaster@example.com>' \
RSET \
QUIT
do \
[ -n "$x" -a $DEBUG -ne 0 ] && printf '>>> %s\n' "$x" >&2
[ -n "$x" ] && printf '%s\r\n' "$x" >&3
getresp <&3 || exit $?
done)
ret=$?
exec 3>&- # close fd
ok=$(echo "$messages" | grep -xF '250 2.0.0 Ok')
fail=$(echo "$messages" | sed -e1d | grep -v ^2 | grep '')
if [ $ret -ne 0 -o -z "$ok" ]; then
echo "Missing OK line (ret $ret). Got:" $messages
false
elif [ -n "$fail" ]; then
echo "$fail"
false
else
true
fi
One can invoke this from something like a Haproxy check script — this is also the part where you add a timeout call.
#!/bin/sh
CHECK_SCRIPT=/usr/local/bin/postfix-is-healthy
if test -f /srv/in_maintenance; then
echo 'drain stopped 0%'
elif error=$(timeout -k3 -s1 2 $CHECK_SCRIPT 2>&1); then
echo 'ready up 100%'
else
echo 'fail #' $error
fi
And that script could be invoked from something like xinetd:
service postfix-haproxy
{
flags = REUSE
socket_type = stream
type = unlisted
port = 1025
wait = no
user = nobody
server = /usr/local/bin/postfix-haproxy-agent-check
log_on_failure += USERID
disable = no
only_from = 0.0.0.0/0 ::1
per_source = UNLIMITED
}