[torqueusers] Torque behaving strangely, jobs becoming blockedindefinitely

Leigh Gordon Leigh.Gordon at utas.edu.au
Wed Aug 22 22:13:21 MDT 2007


Hi everyone,

This problem has just manifested itself again, in a different way. Users' jobs 
are actually being queued, and then being killed, rather than sitting in the 
BLOCKED queue as before. 

"lwgordon at whiteout:~> qsub -I -lncpus=5
qsub: waiting for job 14093.whiteout.sf.utas.edu.au to start
qsub: job 14093.whiteout.sf.utas.edu.au ready

error: wanted 5 cpus but only 1 cpus are available.  Error in PBS setup?

qsub: job 14093.whiteout.sf.utas.edu.au completed"

However, showq reports: "68 Active Jobs     113 of  118 Processors Active 
(95.76%)"

Single CPU jobs seem to be entering running queue fine, but 4 CPU jobs enter 
the running queue, and then die shortly afterwards.

"whiteout:/var/spool/maui/log # tracejob 14090
/var/spool/torque/sched_logs/20070823: No such file or directory

Job: 14090.whiteout.sf.utas.edu.au

08/23/2007 12:54:52  S    enqueuing into batch, state 1 hop 1
08/23/2007 12:54:52  S    Job Queued at request of 
rijamos at whiteout.sf.utas.edu.au, owner = rijamos at whiteout.sf.utas.edu.au, job 
name = mpmnitror, queue = batch
08/23/2007 12:54:52  A    queue=batch
08/23/2007 12:54:53  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/23/2007 12:54:53  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/23/2007 12:55:04  S    send of job to whiteout failed error = 15020
08/23/2007 12:55:14  S    send of job to whiteout failed error = 15020
08/23/2007 12:55:14  S    child reported failure for job after 21 seconds 
(dest=whiteout), rc=10
08/23/2007 12:55:14  S    unable to run job, MOM rejected/timeout
08/23/2007 13:01:43  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/23/2007 13:01:43  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/23/2007 13:01:51  S    send of job to whiteout failed error = 15020
08/23/2007 13:02:01  S    send of job to whiteout failed error = 15020
08/23/2007 13:02:01  S    child reported failure for job after 18 seconds 
(dest=whiteout), rc=10
08/23/2007 13:02:01  S    unable to run job, MOM rejected/timeout
08/23/2007 13:17:13  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/23/2007 13:17:13  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/23/2007 13:17:24  S    send of job to whiteout failed error = 15020
08/23/2007 13:17:34  S    send of job to whiteout failed error = 15020
08/23/2007 13:17:34  S    child reported failure for job after 21 seconds 
(dest=whiteout), rc=10
08/23/2007 13:17:34  S    unable to run job, MOM rejected/timeout
08/23/2007 13:23:15  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/23/2007 13:23:15  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/23/2007 13:23:23  S    send of job to whiteout failed error = 15020
08/23/2007 13:23:36  S    send of job to whiteout failed error = 15020
08/23/2007 13:23:36  S    child reported failure for job after 21 seconds 
(dest=whiteout), rc=10
08/23/2007 13:23:36  S    unable to run job, MOM rejected/timeout
08/23/2007 13:29:34  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/23/2007 13:29:34  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/23/2007 13:29:45  S    send of job to whiteout failed error = 15020
08/23/2007 13:29:58  S    send of job to whiteout failed error = 15020
08/23/2007 13:29:58  S    child reported failure for job after 24 seconds 
(dest=whiteout), rc=10
08/23/2007 13:29:58  S    unable to run job, MOM rejected/timeout
08/23/2007 13:35:02  S    Job Modified at request of 
maui at whiteout.sf.utas.edu.au
08/23/2007 13:35:02  S    Job Run at request of maui at whiteout.sf.utas.edu.au
08/23/2007 13:35:07  S    child reported success for job after 5 seconds 
(dest=whiteout), rc=0
08/23/2007 13:35:07  A    user=rijamos group=users jobname=mpmnitror 
queue=batch ctime=1187837692 qtime=1187837692 etime=1187837692 
start=1187840107 exec_host=whiteout Resource_List.mem=6700mb
                          Resource_List.ncpus=4 
Resource_List.neednodes=whiteout Resource_List.nodect=1 
Resource_List.walltime=400:00:00
08/23/2007 13:35:08  S    Exit_status=-2
08/23/2007 13:35:08  A    user=rijamos group=users jobname=mpmnitror 
queue=batch ctime=1187837692 qtime=1187837692 etime=1187837692 
start=1187840107 exec_host=whiteout Resource_List.mem=6700mb
                          Resource_List.ncpus=4 
Resource_List.neednodes=whiteout Resource_List.nodect=1 
Resource_List.walltime=400:00:00 session=0 end=1187840108 Exit_status=-2
08/23/2007 13:36:27  M    forking to user, uid: 55197  gid: 100  
homedir: '/u/rijamos'
08/23/2007 13:36:35  S    dequeuing from batch, state COMPLETE"


If I run strace on MOM, it's doing this for a LONG time, looping over every 
process in the system(about 1300ish) a lot of times(once for each job in the 
queue?):

"open("/proc/8181/stat", O_RDONLY)       = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2000000002900000
read(12, "8181 (UVic_ESCM) R 8177 8097 809"..., 1024) = 267
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
close(12)                               = 0
munmap(0x2000000002900000, 65536)       = 0
open("/proc/8457/stat", O_RDONLY)       = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2000000002900000
read(12, "8457 (bash) S 16979 8457 8457 0 "..., 1024) = 265
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
close(12)                               = 0
munmap(0x2000000002900000, 65536)       = 0
open("/proc/8468/stat", O_RDONLY)       = 12
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2000000002900000
read(12, "8468 (bash) S 16979 8468 8468 0 "..., 1024) = 266
fstat(12, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
close(12)                               = 0
munmap(0x2000000002900000, 65536)       = 0"

Even when this finishes and the strace goes back to a sane speed(just 
reading /proc/loadavg every now and then), the node reports as down for a 
while, then comes back up, only to be taken down by MOM again!

Is there ANY way to stop the system from crawling to a halt and killing jobs 
as soon as they are submitted? Even if people's jobs are queued and checked 
less frequently, it would be preferable to the jobs being killed for no 
reason and having to try and resubmit them hoping that it will work this 
time! Frustrating as you can imagine :)

It seems that even when there's sufficient CPUs free, by the time it's gets 
around to scheduling the job, it's back to it's "/proc trawling" again so 
then can't communicate and the job either enters the queue and then dies, or 
remains in a deferred state.

Any ideas? I realise a proper fix involves rewriting the program for better 
scalability, but does anyone have any ideas to alleviate this problem in the 
meantime? Maybe suggest which values in MOM/maui/pbs_server config can be 
tweaked to lighten to load? 

I think you were right on the money David, this /proc business is a bit 
ridiculous and it's ironic that something with this much grunt can be slowed 
to a crawl by a single process!

Thanks
Leigh

-- 
Regards,

Leigh Gordon
High Performance Computing Systems Administrator
IT Resources, University of Tasmania
Phone: 03 6226 6389
http://www.tpac.org.au

"What we're really after is simply that people acquire a legal license for 
Windows for each computer they own before they move on to Linux or Sun 
Solaris or BSD or OS/2 or whatever."
- Bill Gates -

On Mon, 20 Aug 2007 08:54:55 am Leigh Gordon wrote:
> Hi, thanks for the reply.
>
> I've looked into it a bit more(even going as far as looking at the source
> code for 'top', which unfortunately also uses "/proc" rather than system
> calls) and it seems like it's a job for someone with a lot more experience
> with C than myself. I did find a GPL project called Supermon, which I read
> a paper on(and it specifically said it used system calls rather than /proc
> on Linux) which may hold some clues on how to do it.
>
> Decreasing the mom polling frequency down to 5 minutes(was 3 minutes) seems
> to at least be preventing the jobs from becoming blocked for good(the jobs
> that get marked "Deferred" return to the "Idle" status, rather than
> "BatchHold"), although it doesn't prevent the system having it's little
> fits every few minutes.
>
> Thanks again,
> Leigh
>
> On Thursday 16 August 2007 06:43:45 David Singleton wrote:
> > Hi Leigh,
> >
> > I'm guessing the problem is the way torque MOMs trawl /proc on
> > Linux causing the MOM to spend a lot of time very busy.  Have
> > you strace'd the MOM? Is your MOM running in the same cpuset
> > as jobs or otherwise fighting for cpu time?
> >
> > I haven't looked at this for a while but it used to be that the MOM
> > would run through the whole of /proc multiple times per job (in
> > cput_sum(), overcpu_proc(), mem_sum(), ....).  Reading /proc on a
> > full 128P system is not cheap and if most of your jobs are single
> > cpu then your multiplier of this number is large.  The cost is about
> > 64^2 times what it is on a cluster of 2P nodes.
> >
> > There are multiple ways this can be improved.  On solaris and
> > digitalunix, MOMs reuse a single trawling of /proc for multiple jobs.
> > Since most uses are sampling for resource usage/limits, it doesn't
> > matter if the sampled /proc data is a few millisecs stale.  [On top
> > of this, the binary blob dump of /proc entries on Solaris and Tru64
> > is noticeably *faster* than Linux's annoying "readable" format so
> > MOM on Linux takes a double hit in comparison.]
> >
> > Another alternative is to use cpusets or SGI "process aggregates"
> > (as used in SGI array sessions or their "job" module) to have the
> > kernel accurately track the list of processes in a job (actually in
> > each task of a job).  Then, for each job, MOM can just loop over the
> > list of pids in these job "containers" task lists instead of the
> > whole of /proc.  [Good news for Linux: a generic "process container"
> > infrastructure is hopefully being added to the kernel. It effectively
> > provides an indelible session id and a means of accessing job
> > task lists.]
> >
> > There is benefit in applying both these modifications.
> >
> > I could be totally wrong about what is causing your system grief but
> > it does sound like a MOM scalability issue. If so, fixing it will
> > require some development work.
> >
> > Cheers,
> > David





More information about the torqueusers mailing list