[Mauiusers] jobs targeted at specific systems - a problem?

Miles O'Neal meo at intrinsity.com
Mon Dec 3 14:16:35 MST 2007

[We run torque and maui.]

>From time to time we run system administration
jobs on all the machines or a targeted subgroup.
To avoid interference with user community jobs,
we do something like this:

   foreach i ( `cat node_list` )
      qsub -q our_queue -l nodes=$i script_path

.  Twice in the past month when we've done this,
we have had a few of these jobs sit in the queue
over 24 hours.  Between hour 24 and hour 30 or so
we start seeing some slowdown and the number of
machines running jobs drops to less than half the
available machines.

In both cases, by the time our group was notified,
other people were trying lots of random things,
so even though everything returned to normal shortly
after the delayed jobs got qdel'd (because someone
had a hunch about them), we have no way of knowing
for sure whether these jobs were the cause.

Has anyone else seen similar problems?  Were they
related to jobs targeted to specific machines but
sitting the queue a long time?


Miles O'Neal
IT Manager
Intrinsity, Inc.
meo at intrinsity.com

