[torquedev] [Fwd: [Mauiusers] FW: maui pausing on Torque multiple qsubs]

Craig Macdonald craigm at dcs.gla.ac.uk
Tue Feb 20 16:47:43 MST 2007


(Apologies for the cross-post, this is a cross-discipline problem)
I've managed to trace this problem a bit further:

Essentially, Maui has my PBS API timeout set to 9 seconds, which the 
pbs_statnode() call honours.
However, once the timeout has been detected, (around line 1270 of MPBSI.c),
Maui tries to disconnect from the pbs_server, using pbs_disconnect().
pbs_disconnect() sets an alarm, for 9 seconds, then tries to read the 
socket.
read() is defined as read_nonblocking_socket() in nonblock.c. However, 
this is what blocks.
The gdb trace is below.


So there are two or three issues here:
1. pbs_disconnect() shouldn't block for 15 minutes, as it has an alarm() 
round it?
The alarm() value is set by Maui to be 9 seconds, by setting the 
PBSAPITIMEOUT env var - The actual timeout in a recent case was 916 
seconds. (about 15 mins).
NB: I havent recompiled torque to see what value of PBSAPITIMEOUT it 
sees, but I have checked that Maui sets PBSAPITIMEOUT correctly.

2. Why isnt' read_nonblocking_socket() doing what it says on the tin?

3. What is MUThread() for in Maui, if it doesnt provide timeouts? I know 
from extra debug statements it is the pbs_disconnect call that timesout 
after 15mins.

Many thanks for any pearls of wisdom anyone may have

Craig


Program received signal SIGINT, Interrupt.
0x0029c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x0029c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00aaead3 in __read_nocancel () from /lib/tls/libc.so.6
#2  0x0011e5b3 in read_nonblocking_socket (fd=10, buf=0x131720, 
count=16384) at ../Libifl/nonblock.c:116
#3  0x0011f391 in pbs_disconnect (connect=1) at ../Libifl/pbsD_connect.c:597
#4  0x080d7b6a in MPBSClusterQuery (R=0x8d91c40, RCount=0xfef923a8, 
EMsg=0x0, SC=0x0) at MPBSI.c:1288
#5  0x080a015a in __MUTFunc (V=0xfef92320) at MUtil.c:4717
#6  0x080a01d8 in MUThread (F=0x80d7a84 <MPBSClusterQuery>, TimeOut=9, 
RC=0xfef923a4, ACount=4, Lock=0x0) at MUtil.c:4690
#7  0x080d0c51 in MRMClusterQuery (RCount=0xfef923dc, SC=0x0) at MRM.c:493
#8  0x080d0ddf in MRMGetInfo () at MRM.c:352
#9  0x0807341d in MSchedProcessJobs (OldDay=0xfefa3850 "Tue", 
GlobalSQ=0xfef9f850, GlobalHQ=0xfef9b850) at MSched.c:6870
#10 0x0804caff in main (ArgC=2, ArgV=0xfefa3934) at Server.c:192

-------------- next part --------------
An embedded message was scrubbed...
From: "Craig Macdonald" <craigm at dcs.gla.ac.uk>
Subject: [Mauiusers] FW: maui pausing on Torque multiple qsubs
Date: Mon, 5 Feb 2007 23:46:04 -0000
Size: 17375
Url: http://www.supercluster.org/pipermail/torquedev/attachments/20070220/299b9872/mauipausingonTorquemultipleqsubs.mht


More information about the torquedev mailing list