[SPAM] Re: [torqueusers] Torque behaving strangely, jobs becoming blockedindefinitely

Leigh Gordon Leigh.Gordon at utas.edu.au
Sun Aug 19 16:54:55 MDT 2007


Hi, thanks for the reply.

I've looked into it a bit more(even going as far as looking at the source code 
for 'top', which unfortunately also uses "/proc" rather than system calls) 
and it seems like it's a job for someone with a lot more experience with C 
than myself. I did find a GPL project called Supermon, which I read a paper 
on(and it specifically said it used system calls rather than /proc on Linux) 
which may hold some clues on how to do it.

Decreasing the mom polling frequency down to 5 minutes(was 3 minutes) seems to 
at least be preventing the jobs from becoming blocked for good(the jobs that 
get marked "Deferred" return to the "Idle" status, rather than "BatchHold"), 
although it doesn't prevent the system having it's little fits every few 
minutes.

Thanks again,
Leigh

On Thursday 16 August 2007 06:43:45 David Singleton wrote:
> Hi Leigh,
>
> I'm guessing the problem is the way torque MOMs trawl /proc on
> Linux causing the MOM to spend a lot of time very busy.  Have
> you strace'd the MOM? Is your MOM running in the same cpuset
> as jobs or otherwise fighting for cpu time?
>
> I haven't looked at this for a while but it used to be that the MOM
> would run through the whole of /proc multiple times per job (in
> cput_sum(), overcpu_proc(), mem_sum(), ....).  Reading /proc on a
> full 128P system is not cheap and if most of your jobs are single
> cpu then your multiplier of this number is large.  The cost is about
> 64^2 times what it is on a cluster of 2P nodes.
>
> There are multiple ways this can be improved.  On solaris and
> digitalunix, MOMs reuse a single trawling of /proc for multiple jobs.
> Since most uses are sampling for resource usage/limits, it doesn't
> matter if the sampled /proc data is a few millisecs stale.  [On top
> of this, the binary blob dump of /proc entries on Solaris and Tru64
> is noticeably *faster* than Linux's annoying "readable" format so
> MOM on Linux takes a double hit in comparison.]
>
> Another alternative is to use cpusets or SGI "process aggregates"
> (as used in SGI array sessions or their "job" module) to have the
> kernel accurately track the list of processes in a job (actually in
> each task of a job).  Then, for each job, MOM can just loop over the
> list of pids in these job "containers" task lists instead of the
> whole of /proc.  [Good news for Linux: a generic "process container"
> infrastructure is hopefully being added to the kernel. It effectively
> provides an indelible session id and a means of accessing job
> task lists.]
>
> There is benefit in applying both these modifications.
>
> I could be totally wrong about what is causing your system grief but
> it does sound like a MOM scalability issue. If so, fixing it will
> require some development work.
>
> Cheers,
> David
>
> Leigh Gordon wrote:
> > Hi everyone,
> >
> > (I'm not 100% sure on whether this is a problem with maui or torque, so
> > please tell me if the maui list would be more appropriate!)
> >
> > We've got an issue which has only occurred since increasing the number of
> > CPUs on our Altix 4700 to 128(all in a single node).  When the queue is
> > full(usually when there are a lot of single CPU jobs), there seems to be
> > communications issue between maui/torque/MOM which result in the system
> > doing a couple of strange annoying things, which are:
> >
> > A) node reports as down, with 0 processors active(which isn't the case,
> > as the running jobs continue running fine and the processors are
> > definitely active)
> >
> > whiteout:~ # showq
> > <joblist snipped>
> > 101 Active Jobs       0 of    0 Processors Active (0.00%)
> >
> > -------------------------------------------------------------------------
> >----------------------------------------------------- whiteout:~ #
> > checknode -v whiteout
> >
> >
> > checking node whiteout
> >
> > State:      Down  (in current state for 00:00:00)
> > Configured Resources: PROCS: 118  MEM: 305G  SWAP: 284G  DISK: 5020G
> > Utilized   Resources: PROCS: 118  DISK: 3783G
> > Dedicated  Resources: PROCS: 117  MEM: 70G
> > Opsys:         SLES9  Arch:        ia64
> > Speed:      1.00  Load:      118.180
> > Location:   Partition: DEFAULT  Frame/Slot:  1/1
> > Network:    [DEFAULT]
> > Features:   [batch]
> > Attributes: [Batch]
> > Classes:    [batch 1:118]
> >
> > Total Time: 73:04:30:05  Up: 71:06:58:29 (97.41%)  Active: 71:06:53:45
> > (97.40%)
> > <joblist snipped>
> > ALERT:  jobs active on node but state is Down
> > ALERT:  node is in state Down but load is high (118.180)
> >
> > -------------------------------------------------------------------------
> >------------------------------------------
> >
> > and B) jobs get placed in the "BatchHold" state and stay there. Tracing
> > the history of an example of one of the affected jobs reveals the
> > following:
> >
> > -------------------------------------------------------------------------
> >------------------------------------------ whiteout:~ # grep 12527
> > /var/spool/torque/*_logs/*
> > /var/spool/torque/mom_logs/20070224:02/24/2007 13:17:45;0008;
> > pbs_mom;Job;2994.whiteout.sf.utas.edu.au;kill_task: killing pid 12527
> > task 1 with sig 15
> > /var/spool/torque/server_logs/20070814:08/14/2007
> > 19:19:47;0100;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;enqueuing into
> > batch, state 1 hop 1
> > /var/spool/torque/server_logs/20070814:08/14/2007
> > 19:19:47;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Queued at
> > request of wpsijp at whiteout.sf.utas.edu.au, owner =
> > wpsijp at whiteout.sf.utas.edu.au, job name = M2.pbs, queue = batch
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified
> > at request of maui at whiteout.sf.utas.edu.au
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at
> > request of maui at whiteout.sf.utas.edu.au
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:40;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
> > whiteout failed error = 15020
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
> > whiteout failed error = 15020
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:53;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in
> > send_job, child timed-out attempting to start job
> > 12527.whiteout.sf.utas.edu.au
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:53;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported
> > failure for job after 21 seconds (dest=whiteout), rc=10
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run
> > job, MOM rejected/timeout
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified
> > at request of maui at whiteout.sf.utas.edu.au
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at
> > request of maui at whiteout.sf.utas.edu.au
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:14;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
> > whiteout failed error = 15020
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to
> > whiteout failed error = 15020
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:27;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in
> > send_job, child timed-out attempting to start job
> > 12527.whiteout.sf.utas.edu.au
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:27;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported
> > failure for job after 21 seconds (dest=whiteout), rc=10
> > /var/spool/torque/server_logs/20070815:08/15/2007
> > 01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run
> > job, MOM rejected/timeout
> >
> >
> > -------------------------------------------------------------------------
> >----------------- whiteout:~ # checkjob -v 12527
> >
> >
> > checking job 12527 (RM job '12527.whiteout.sf.utas.edu.au')
> >
> > State: Idle
> > Creds:  user:wpsijp  group:users  class:batch  qos:DEFAULT
> > WallTime: 00:00:00 of 6:00:00
> > SubmitTime: Tue Aug 14 19:19:47
> >   (Time Queued  Total: 15:24:13  Eligible: 00:00:00)
> >
> > StartDate: -9:26:53  Wed Aug 15 01:17:07
> > Total Tasks: 1
> >
> > Req[0]  TaskCount: 1  Partition: ALL
> > Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> > Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> > Exec:  ''  ExecSize: 0  ImageSize: 0
> > Dedicated Resources Per Task: PROCS: 1  MEM: 500M
> > NodeAccess: SHARED
> > NodeCount: 1
> >
> >
> > IWD: [NONE]  Executable:  [NONE]
> > Bypass: 0  StartCount: 2
> > PartitionMask: [ALL]
> > SystemQueueTime: Wed Aug 15 01:22:52
> >
> > Flags:       RESTARTABLE
> >
> > Holds:    Batch  (hold reason:  NoResources)
> > Messages:  exceeds available partition procs
> > PE:  1.00  StartPriority:  1
> > cannot select job 12527 for partition DEFAULT (job hold active)
> >
> > -------------------------------------------------------------------------
> >----------------- whiteout:~ # tracejob -v 12527
> > /var/spool/torque/server_priv/accounting/20070815: No matching job
> > records located
> > /var/spool/torque/server_logs/20070815: Successfully located matching job
> > records
> > /var/spool/torque/mom_logs/20070815: No matching job records located
> > /var/spool/torque/sched_logs/20070815: No such file or directory
> >
> > Job: 12527.whiteout.sf.utas.edu.au
> >
> > 08/15/2007 01:11:32  S    Job Modified at request of
> > maui at whiteout.sf.utas.edu.au
> > 08/15/2007 01:11:32  S    Job Run at request of
> > maui at whiteout.sf.utas.edu.au 08/15/2007 01:11:40  S    send of job to
> > whiteout failed error = 15020 08/15/2007 01:11:53  S    send of job to
> > whiteout failed error = 15020 08/15/2007 01:11:53  S    child reported
> > failure for job after 21 seconds (dest=whiteout), rc=10
> > 08/15/2007 01:11:53  S    unable to run job, MOM rejected/timeout
> > 08/15/2007 01:17:06  S    Job Modified at request of
> > maui at whiteout.sf.utas.edu.au
> > 08/15/2007 01:17:06  S    Job Run at request of
> > maui at whiteout.sf.utas.edu.au 08/15/2007 01:17:14  S    send of job to
> > whiteout failed error = 15020 08/15/2007 01:17:27  S    send of job to
> > whiteout failed error = 15020 08/15/2007 01:17:27  S    child reported
> > failure for job after 21 seconds (dest=whiteout), rc=10
> > 08/15/2007 01:17:27  S    unable to run job, MOM rejected/timeout
> > -------------------------------------------------------------------------
> >-----------------
> >
> > It does this periodically during the day(very frequently in fact, so I'm
> > not sure whether it's linked to a periodic scheduled event every few
> > minutes?). The node returns to a normal "running" state and the CPUs
> > report as being used, but it inevitably happens again constantly.
> >
> > Can anyone shed some light on where this problem might be and what can be
> > done to resolve it? I can provide more information if required!
> >
> > The main issue that it causes is having to manually release blocked jobs
> > when there are available CPUs(which then runs the jobs fine), but
> > obviously it should be able to automatically manage the queue without
> > resorting to this intervention! Thanks
> >
> > Regards,
> >
> > Leigh Gordon
> > High Performance Computing Systems Administrator
> > IT Resources, University of Tasmania
> > Phone: 03 6226 6389
> > http://www.tpac.org.au
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers



-- 
Regards,

Leigh Gordon
High Performance Computing Systems Administrator
IT Resources, University of Tasmania
Phone: 03 6226 6389
http://www.tpac.org.au

"What we're really after is simply that people acquire a legal license for 
Windows for each computer they own before they move on to Linux or Sun 
Solaris or BSD or OS/2 or whatever."
- Bill Gates -


More information about the torqueusers mailing list