I am very happy with my current mail setup. But a nasty bug popped up out of nowhere and i can’t trace it…. Some qmail-remote processes, in some circunstances yet to determine, just hang-up forever eating all the available CPU. The qmail-remote is the piece of Qmail that takes care of message delivery to recipients at a remote host.
When this happens, the stuck process doesn’t conform to the timeoutremote control and stays active forever. The truss command (FreeBSD equivalent to strace) doesn’t show any activity and neither appears to be related network activity… it looks like some kind of race condition.
For now i couldn’t really address the issue, to both lack of time and deep understanding about C and debugging with GDB, so i just mitigate the problem with a cron running a bash script to detect and kill the offending processes.
#!/usr/local/bin/bash
# limit in seconds
LIMIT_TIME=120
# limit in cpu
LIMIT_CPU=35
IFS=$'\n'
for line in `ps -xao pid,etime,command,%cpu | grep qmail-remote`; do
pid=`echo $line | awk '{split($0,a," "); print a[1]}'`
time=`echo $line | awk '{split($0,a," "); print a[2]}'`
cpu=`echo $line | awk '{split($0,a," "); print a[5]}'`
cpu=`echo "($cpu)/1+1" | bc`
IFS=$':'
time_parts=($time)
if [ ${#time_parts[@]} -lt 3 ]; then
elapsed=`echo ${time_parts[0]}*60 + ${time_parts[1]} | bc`
else
elapsed=`echo ${time_parts[0]}*3600 + ${time_parts[1]}*60 + ${time_parts[2]} | bc`
fi
if [ $elapsed -gt $LIMIT_TIME -a $cpu -gt $LIMIT_CPU ]; then
kill -s 9 $pid
fi
IFS=$'\n'
done
But i’m not really happy with this “solution”, and will be pursuing a real understanding and solution for this proble.
Some interesting links about other people with the same problem:
http://permalink.gmane.org/gmane.os.freebsd.stable/82760
http://copilotco.com/mail-archives/qmail.2002/msg08733.html
UPDATE and SOLUTION
All credits, to where credits are due.
To replicate this, you should catch an hanging qmail-remote with top. Then filter the offending qmail-remote pid trough ps to get full arguments list:
ps -wwaux | grep pid_number
You should get something like ‘qmail-remote mailserver from@email to@email’. With this information, and with top and truss you can invoke qmail-remote from the command line and get a nice qmail-remote hang…
truss /var/qmail/bin/qmail-remote mailserver my@email nonexisten@email < ./test
test is just a file with some bogus input to serve as the sending email, as qmail-remote expects a message in stdin. Just for the note, most of the hangs happens when talking to the Symantec Email Gateway software.
Now, with an updated ports tree, just recompile qmail, you can follow my guide (all good there). But, just issue make, no need to make install. Then move qmail-remote to /var/qmail/bin/ and set the right permissions (711) and ownership (root:qmail).
And voilá, if you repeat the test procedure, you will find that qmail-remote is not hanging anymore 🙂
same here..
IT’S FIXED AT LAST!!
http://svnweb.freebsd.org/ports/head/mail/qmail-tls/Makefile?view=log
Modified Tue Oct 16 13:35:28 2012 UTC (5 weeks, 1 day ago) by garga
– Update TLS patch to v2, what address an issue that qmail-remote loops on
malformed server response
I’ve just upgraded and tested with that fuckin’ symantec mx.
My guess is that the malformed header here is the SIZE:
250-ENHANCEDSTATUSCODES
250-SIZE 10485760
250-8BITMIME
250 PIPELINING
It has an extra space.
MEGA KUDOS
will recompile with latest TLS patch and hope that will be fixed. HATE this bug!