[torquedev] [Fwd: [Mauiusers] FW: maui pausing on Torque multiple
craigm at dcs.gla.ac.uk
Tue Feb 20 16:47:43 MST 2007
(Apologies for the cross-post, this is a cross-discipline problem)
I've managed to trace this problem a bit further:
Essentially, Maui has my PBS API timeout set to 9 seconds, which the
pbs_statnode() call honours.
However, once the timeout has been detected, (around line 1270 of MPBSI.c),
Maui tries to disconnect from the pbs_server, using pbs_disconnect().
pbs_disconnect() sets an alarm, for 9 seconds, then tries to read the
read() is defined as read_nonblocking_socket() in nonblock.c. However,
this is what blocks.
The gdb trace is below.
So there are two or three issues here:
1. pbs_disconnect() shouldn't block for 15 minutes, as it has an alarm()
The alarm() value is set by Maui to be 9 seconds, by setting the
PBSAPITIMEOUT env var - The actual timeout in a recent case was 916
seconds. (about 15 mins).
NB: I havent recompiled torque to see what value of PBSAPITIMEOUT it
sees, but I have checked that Maui sets PBSAPITIMEOUT correctly.
2. Why isnt' read_nonblocking_socket() doing what it says on the tin?
3. What is MUThread() for in Maui, if it doesnt provide timeouts? I know
from extra debug statements it is the pbs_disconnect call that timesout
Many thanks for any pearls of wisdom anyone may have
Program received signal SIGINT, Interrupt.
0x0029c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#0 0x0029c7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00aaead3 in __read_nocancel () from /lib/tls/libc.so.6
#2 0x0011e5b3 in read_nonblocking_socket (fd=10, buf=0x131720,
count=16384) at ../Libifl/nonblock.c:116
#3 0x0011f391 in pbs_disconnect (connect=1) at ../Libifl/pbsD_connect.c:597
#4 0x080d7b6a in MPBSClusterQuery (R=0x8d91c40, RCount=0xfef923a8,
EMsg=0x0, SC=0x0) at MPBSI.c:1288
#5 0x080a015a in __MUTFunc (V=0xfef92320) at MUtil.c:4717
#6 0x080a01d8 in MUThread (F=0x80d7a84 <MPBSClusterQuery>, TimeOut=9,
RC=0xfef923a4, ACount=4, Lock=0x0) at MUtil.c:4690
#7 0x080d0c51 in MRMClusterQuery (RCount=0xfef923dc, SC=0x0) at MRM.c:493
#8 0x080d0ddf in MRMGetInfo () at MRM.c:352
#9 0x0807341d in MSchedProcessJobs (OldDay=0xfefa3850 "Tue",
GlobalSQ=0xfef9f850, GlobalHQ=0xfef9b850) at MSched.c:6870
#10 0x0804caff in main (ArgC=2, ArgV=0xfefa3934) at Server.c:192
-------------- next part --------------
An embedded message was scrubbed...
From: "Craig Macdonald" <craigm at dcs.gla.ac.uk>
Subject: [Mauiusers] FW: maui pausing on Torque multiple qsubs
Date: Mon, 5 Feb 2007 23:46:04 -0000
More information about the torquedev