[torquedev] torque server dieing after bad request

Michael Meier Michael.Meier at rrze.uni-erlangen.de
Tue Jun 23 03:10:46 MDT 2009


We've hit another bug in the torque server. Over the weekend, the pbs server 
of our main cluster kept dieing every few hours. The last message in the logs 
was always something like:
 06/22/2009 14:32:06;0080;PBS_Server;Req;dis_request_read;req header bad, dis 
error 1 (Input value too large to convert to this type), type=JobObituary
 06/22/2009 14:32:06;0080;PBS_Server;Req;req_reject;Reject reply code=15056
(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, 
type=JobObituary, from @
Of course, the fact that torque was not telling which node sent the bad 
message was EXTREMELY annoying. gdb revealed that it wasn't segfaulting, it 
was calling abort():
 Program received signal SIGABRT, Aborted.
 0x00002aea3e0beb75 in raise () from /lib64/libc.so.6
 (gdb) bt
 #0  0x00002aea3e0beb75 in raise () from /lib64/libc.so.6
 #1  0x00002aea3e0bff30 in abort () from /lib64/libc.so.6
 #2  0x0000000000436606 in get_next (pl={ll_prior = 0x0, ll_next = 0x0, 
ll_struct = 0x0}, file=<value optimized out>, line=<value optimized out>)
    at ../Libifl/list_link.c:372
 #3  0x000000000042d96e in free_attrlist (pattrlisthead=<value optimized out>) 
at attr_func.c:405
 #4  0x000000000041207c in free_br (preq=0x5c8e20) at process_request.c:1180
 #5  0x00000000004135b3 in reply_send (request=0x5c8e20) at reply_send.c:294
 #6  0x000000000041399d in req_reject (code=15056, aux=0, preq=0x5c8e20, 
HostName=0x0, Msg=0x43e7c8 "cannot decode message") at reply_send.c:446
 #7  0x0000000000412962 in process_request (sfds=13) at process_request.c:537
 #8  0x0000000000438f02 in wait_request (waittime=<value optimized out>, 
SState=0x5652b8) at ../Libnet/net_server.c:469
 #9  0x00000000004116df in main (argc=<value optimized out>, argv=<value 
optimized out>) at pbsd_main.c:1378
As you can see, after handling the bad request, it tries to free the resources 
associated with it, including the attrlist. However, the attrlist is 
completely empty from the bad request. That is why the linked list code trips 
over sort of an assertion - and the server just exits.
The fix really is a nobrainer: Check attrlist, and don't try to empty it if it 
already is empty. You can find a patch against 2.3.6 attached. That patch 
prevents the torque server from dieing just because it received a bad 
request.
Also attached is a second patch. I used it to find the bad node and it might 
be helpful for other people trying to trace similar problems - but it 
certainly isn't good for inclusion into mainline.
BTW: In the end it turned out that the reason for all of this was that one of 
the moms had gone bad, and corrupted some internal structures. It was logging 
nonsense characters in every logline (Random characters in the place where it 
usually writes "  pbs_mom"), and although it was still starting jobs, it 
occasionally sent a bad message to the server - causing the server to die.
-- 
Michael Meier, HPC Services
Friedrich-Alexander-Universitaet Erlangen-Nuernberg
Regionales Rechenzentrum Erlangen
Martensstrasse 1, 91058 Erlangen, Germany
Tel.: +49 9131 85-28973, Fax: +49 9131 302941
michael.meier at rrze.uni-erlangen.de
www.rrze.uni-erlangen.de/hpc/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: torque-2.3.6-donotdieafterbadrequest.patch
Type: text/x-diff
Size: 550 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20090623/f1c3dfdf/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: torque-2.3.6-showpeerad.patch
Type: text/x-diff
Size: 1547 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torquedev/attachments/20090623/f1c3dfdf/attachment-0001.bin 


More information about the torquedev mailing list