[torqueusers] Torque not responding to maui...
agrvaibhav at gmail.com
Wed Jan 9 23:30:57 MST 2008
I have torque resource manager(ver. 2.1.6) with 100 FreeBSD boxes and using
maui(3.2.6p16) as scheduler.
The cluster as a whole behaves very well, when the number of jobs submitted
is less than 8000 and it executes around 300 jobs as a whole which is as per
configuration. But when the number of jobs exceeds 8000, it behaves in a
very strange way. The number of jobs which were getting scheduled descreases
to 100, sometime 50 or sometime no jobs are scheduled. The cluster goes into
hung state and the jobs remain queued for a long time(usually 45 mins to an
While investigating the issue and looking into the source code of maui and
torque, I came to know that, maui queries for torque node status. This query
is not replied by torque and then maui proceeds for disconnecting the
connection with torque. Now this disconnect request is also not replied by
torque in promptly manner and maui waits for the reply from torque. This
wait time sometimes become 45 minutes to an hour. Now, looking into the
source code of torque, I was not able to track what is the reason that
torque doesn't responds to maui in a timely fashion.
Does any of you has any idea what the problem may be, and also has someone
used torque with greater than 10000 jobs queued.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the torqueusers