[torqueusers] 100+ job lauch failures - 15009 errors.

Garrick Staples garrick at usc.edu
Tue Nov 6 15:35:37 MST 2007


On Tue, Nov 06, 2007 at 02:52:27PM -0600, Amitoj G Singh alleged:
> Introduction
> ============
> o Torque version: 2.1.6
> o Maui version: 3.2.6p18
> o Server kernel version: 2.6.9-55.0.2.ELsmp
> o Worker kernel version: 2.6.21
> o 600-node AMD dual-core dual-socket cluster, 4GB memory per worker node,
> fast Ethernet.
> o 520-node Intel P4 single-core cluster, 1GB memory per worker node, fast
> Ethernet.
> 
> Problem
> ========
> 
> The server serves both the 600-node and the 520-node cluster, each cluster
> has a different queue. When submitting a 100+ node job to either queue,
> the job goes from Q -> R -> Q and the following messages appear on the job
> worker-mom-head-node and mom-worker-nodes ..
> 
> pbs_mom;Svr;pbs_mom;Job with requested ID already exists (15009) in
> im_request, KILL/ABORT request for job 202902 returned error

This means pbs_mom is timing out and requests are being resent.

There is a latency issue somewhere.  It could be in the OS, filesystems,
network problems, or blocking in pbs_mom.

Can we see MOM's config and the server config from qmgr?

 
> Doing several "qrun <jobid>" kicks the job off finally. We configured
> Torque 2.1.6 with the following:
> 
> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
> --with-default-server=foobar --with-rcp=/usr/bin/rcp
> 
> What started this
> =================
> Everything was working fine until we upgraded the worker kernels from
> 2.6.12.6 to 2.6.21.

So go back to 2.6.12.6?

 
> Our diagnosis attempts
> ======================
> Building torque 2.1.9 with ...
> 
> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
> --with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp

Don't use --disable-rpp.  It has no effect on your reported problem and is just
a bad idea.  I don't know where people keep getting this from.


> We used our old 2.1.6 server settings. With torque 2.1.9 as configured
> above, when requesting nodes we would get only 1 node per job, we could
> request 1 or 300 nodes, torque would only assign 1 node per job.
> 
> Any help in this regard would be much appreciated.

So you've got 2 problems?  The latency issues that came when you upgraded the
kernel, and the scheduling problems.  The later is probably because of a
default_resources.ncpus.

Not that if you are using Maui, then torque isn't assigning nodes.  You can use
'checkjob' to see what Maui is assigning.



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071106/f13e86c9/attachment.bin


More information about the torqueusers mailing list