[torqueusers] Torque behaving strangely, jobs becoming blocked indefinitely

David Singleton David.Singleton at anu.edu.au
Wed Aug 15 14:43:45 MDT 2007


Hi Leigh,

I'm guessing the problem is the way torque MOMs trawl /proc on
Linux causing the MOM to spend a lot of time very busy.  Have
you strace'd the MOM? Is your MOM running in the same cpuset
as jobs or otherwise fighting for cpu time?

I haven't looked at this for a while but it used to be that the MOM
would run through the whole of /proc multiple times per job (in
cput_sum(), overcpu_proc(), mem_sum(), ....).  Reading /proc on a
full 128P system is not cheap and if most of your jobs are single
cpu then your multiplier of this number is large.  The cost is about
64^2 times what it is on a cluster of 2P nodes.

There are multiple ways this can be improved.  On solaris and
digitalunix, MOMs reuse a single trawling of /proc for multiple jobs.
Since most uses are sampling for resource usage/limits, it doesn't
matter if the sampled /proc data is a few millisecs stale.  [On top
of this, the binary blob dump of /proc entries on Solaris and Tru64
is noticeably *faster* than Linux's annoying "readable" format so
MOM on Linux takes a double hit in comparison.]

Another alternative is to use cpusets or SGI "process aggregates"
(as used in SGI array sessions or their "job" module) to have the
kernel accurately track the list of processes in a job (actually in
each task of a job).  Then, for each job, MOM can just loop over the
list of pids in these job "containers" task lists instead of the
whole of /proc.  [Good news for Linux: a generic "process container"
infrastructure is hopefully being added to the kernel. It effectively
provides an indelible session id and a means of accessing job
task lists.]

There is benefit in applying both these modifications.

I could be totally wrong about what is causing your system grief but
it does sound like a MOM scalability issue. If so, fixing it will
require some development work.

Cheers,
David


Leigh Gordon wrote:
> Hi everyone,
> 
> (I'm not 100% sure on whether this is a problem with maui or torque, so please 
> tell me if the maui list would be more appropriate!)
> 
> We've got an issue which has only occurred since increasing the number of CPUs 
> on our Altix 4700 to 128(all in a single node).  When the queue is 
> full(usually when there are a lot of single CPU jobs), there seems to be 
> communications issue between maui/torque/MOM which result in the system doing 
> a couple of strange annoying things, which are:
> 
> A) node reports as down, with 0 processors active(which isn't the case, as the 
> running jobs continue running fine and the processors are definitely active)
> 
> whiteout:~ # showq
> <joblist snipped>
> 101 Active Jobs       0 of    0 Processors Active (0.00%)
> 
> ------------------------------------------------------------------------------------------------------------------------------
> whiteout:~ # checknode -v whiteout
> 
> 
> checking node whiteout
> 
> State:      Down  (in current state for 00:00:00)
> Configured Resources: PROCS: 118  MEM: 305G  SWAP: 284G  DISK: 5020G
> Utilized   Resources: PROCS: 118  DISK: 3783G
> Dedicated  Resources: PROCS: 117  MEM: 70G
> Opsys:         SLES9  Arch:        ia64
> Speed:      1.00  Load:      118.180
> Location:   Partition: DEFAULT  Frame/Slot:  1/1
> Network:    [DEFAULT]
> Features:   [batch]
> Attributes: [Batch]
> Classes:    [batch 1:118]
> 
> Total Time: 73:04:30:05  Up: 71:06:58:29 (97.41%)  Active: 71:06:53:45 
> (97.40%)
> <joblist snipped>
> ALERT:  jobs active on node but state is Down
> ALERT:  node is in state Down but load is high (118.180)
> 
> -------------------------------------------------------------------------------------------------------------------
> 
> and B) jobs get placed in the "BatchHold" state and stay there. Tracing the 
> history of an example of one of the affected jobs reveals the following:
> 
> -------------------------------------------------------------------------------------------------------------------
> whiteout:~ # grep 12527 /var/spool/torque/*_logs/*
> /var/spool/torque/mom_logs/20070224:02/24/2007 13:17:45;0008;   
> pbs_mom;Job;2994.whiteout.sf.utas.edu.au;kill_task: killing pid 12527 task 1 
> with sig 15
> /var/spool/torque/server_logs/20070814:08/14/2007 
> 19:19:47;0100;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;enqueuing into 
> batch, state 1 hop 1
> /var/spool/torque/server_logs/20070814:08/14/2007 
> 19:19:47;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Queued at 
> request of wpsijp at whiteout.sf.utas.edu.au, owner = 
> wpsijp at whiteout.sf.utas.edu.au, job name = M2.pbs, queue = batch
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified at 
> request of maui at whiteout.sf.utas.edu.au
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:32;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at request 
> of maui at whiteout.sf.utas.edu.au
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:40;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
> whiteout failed error = 15020
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
> whiteout failed error = 15020
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:53;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in 
> send_job, child timed-out attempting to start job 
> 12527.whiteout.sf.utas.edu.au
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:53;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported 
> failure for job after 21 seconds (dest=whiteout), rc=10
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:11:53;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run job, 
> MOM rejected/timeout
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Modified at 
> request of maui at whiteout.sf.utas.edu.au
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:06;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;Job Run at request 
> of maui at whiteout.sf.utas.edu.au
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:14;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
> whiteout failed error = 15020
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;send of job to 
> whiteout failed error = 15020
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:27;0001;PBS_Server;Svr;PBS_Server;Expired credential (15020) in 
> send_job, child timed-out attempting to start job 
> 12527.whiteout.sf.utas.edu.au
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:27;0002;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;child reported 
> failure for job after 21 seconds (dest=whiteout), rc=10
> /var/spool/torque/server_logs/20070815:08/15/2007 
> 01:17:27;0008;PBS_Server;Job;12527.whiteout.sf.utas.edu.au;unable to run job, 
> MOM rejected/timeout
> 
> 
> ------------------------------------------------------------------------------------------
> whiteout:~ # checkjob -v 12527
> 
> 
> checking job 12527 (RM job '12527.whiteout.sf.utas.edu.au')
> 
> State: Idle
> Creds:  user:wpsijp  group:users  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 6:00:00
> SubmitTime: Tue Aug 14 19:19:47
>   (Time Queued  Total: 15:24:13  Eligible: 00:00:00)
> 
> StartDate: -9:26:53  Wed Aug 15 01:17:07
> Total Tasks: 1
> 
> Req[0]  TaskCount: 1  Partition: ALL
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> Exec:  ''  ExecSize: 0  ImageSize: 0
> Dedicated Resources Per Task: PROCS: 1  MEM: 500M
> NodeAccess: SHARED
> NodeCount: 1
> 
> 
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 2
> PartitionMask: [ALL]
> SystemQueueTime: Wed Aug 15 01:22:52
> 
> Flags:       RESTARTABLE
> 
> Holds:    Batch  (hold reason:  NoResources)
> Messages:  exceeds available partition procs
> PE:  1.00  StartPriority:  1
> cannot select job 12527 for partition DEFAULT (job hold active)
> 
> ------------------------------------------------------------------------------------------
> whiteout:~ # tracejob -v 12527
> /var/spool/torque/server_priv/accounting/20070815: No matching job records 
> located
> /var/spool/torque/server_logs/20070815: Successfully located matching job 
> records
> /var/spool/torque/mom_logs/20070815: No matching job records located
> /var/spool/torque/sched_logs/20070815: No such file or directory
> 
> Job: 12527.whiteout.sf.utas.edu.au
> 
> 08/15/2007 01:11:32  S    Job Modified at request of 
> maui at whiteout.sf.utas.edu.au
> 08/15/2007 01:11:32  S    Job Run at request of maui at whiteout.sf.utas.edu.au
> 08/15/2007 01:11:40  S    send of job to whiteout failed error = 15020
> 08/15/2007 01:11:53  S    send of job to whiteout failed error = 15020
> 08/15/2007 01:11:53  S    child reported failure for job after 21 seconds 
> (dest=whiteout), rc=10
> 08/15/2007 01:11:53  S    unable to run job, MOM rejected/timeout
> 08/15/2007 01:17:06  S    Job Modified at request of 
> maui at whiteout.sf.utas.edu.au
> 08/15/2007 01:17:06  S    Job Run at request of maui at whiteout.sf.utas.edu.au
> 08/15/2007 01:17:14  S    send of job to whiteout failed error = 15020
> 08/15/2007 01:17:27  S    send of job to whiteout failed error = 15020
> 08/15/2007 01:17:27  S    child reported failure for job after 21 seconds 
> (dest=whiteout), rc=10
> 08/15/2007 01:17:27  S    unable to run job, MOM rejected/timeout
> ------------------------------------------------------------------------------------------
> 
> It does this periodically during the day(very frequently in fact, so I'm not 
> sure whether it's linked to a periodic scheduled event every few minutes?). 
> The node returns to a normal "running" state and the CPUs report as being 
> used, but it inevitably happens again constantly.
> 
> Can anyone shed some light on where this problem might be and what can be done 
> to resolve it? I can provide more information if required!
> 
> The main issue that it causes is having to manually release blocked jobs when 
> there are available CPUs(which then runs the jobs fine), but obviously it 
> should be able to automatically manage the queue without resorting to this 
> intervention! Thanks
> 
> Regards,
> 
> Leigh Gordon
> High Performance Computing Systems Administrator
> IT Resources, University of Tasmania
> Phone: 03 6226 6389
> http://www.tpac.org.au
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list