[torqueusers] torque/maui confusion when systems not available

Miles O'Neal meo at intrinsity.com
Mon Dec 10 15:03:30 MST 2007

In the past month, we've had several instances of
maui and/or pbs_server going out to lunch.  In each
case, one or both seem to be confused when a system
wasn't available for an extended period of time.

In two cases, we had queued jobs targeted at specific
systems that stayed busy on previously running jobs
for over 24 hours.

On two more cases, we had systems lock up over a weekend
while running jobs.  In the latter two cases, after 12
to 24 hours, we could not get offlined systems to accept
jobs after running "pbsnodes -c" against them.  In these
cases, the number of running jobs dropped (despite jobs
going into the queues).  In one of those cases, about 10%
of torque and/or maui connections were failing, and most
were taking 5-10 seconds to respond, vs the usual, nearly
instantaneous response.

In every case, we had dequeue or requeue the job in question,
*and* make sure the system in question was back online or
remove it from torque altogether *and* restart maui (in two
cases torque and maui).

Any ideas on the problem?


