[torqueusers] 100+ job lauch failures - 15009 errors.
Garrick Staples
garrick at usc.edu
Tue Nov 6 15:35:37 MST 2007
On Tue, Nov 06, 2007 at 02:52:27PM -0600, Amitoj G Singh alleged:
> Introduction
> ============
> o Torque version: 2.1.6
> o Maui version: 3.2.6p18
> o Server kernel version: 2.6.9-55.0.2.ELsmp
> o Worker kernel version: 2.6.21
> o 600-node AMD dual-core dual-socket cluster, 4GB memory per worker node,
> fast Ethernet.
> o 520-node Intel P4 single-core cluster, 1GB memory per worker node, fast
> Ethernet.
>
> Problem
> ========
>
> The server serves both the 600-node and the 520-node cluster, each cluster
> has a different queue. When submitting a 100+ node job to either queue,
> the job goes from Q -> R -> Q and the following messages appear on the job
> worker-mom-head-node and mom-worker-nodes ..
>
> pbs_mom;Svr;pbs_mom;Job with requested ID already exists (15009) in
> im_request, KILL/ABORT request for job 202902 returned error
This means pbs_mom is timing out and requests are being resent.
There is a latency issue somewhere. It could be in the OS, filesystems,
network problems, or blocking in pbs_mom.
Can we see MOM's config and the server config from qmgr?
> Doing several "qrun <jobid>" kicks the job off finally. We configured
> Torque 2.1.6 with the following:
>
> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
> --with-default-server=foobar --with-rcp=/usr/bin/rcp
>
> What started this
> =================
> Everything was working fine until we upgraded the worker kernels from
> 2.6.12.6 to 2.6.21.
So go back to 2.6.12.6?
> Our diagnosis attempts
> ======================
> Building torque 2.1.9 with ...
>
> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
> --with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp
Don't use --disable-rpp. It has no effect on your reported problem and is just
a bad idea. I don't know where people keep getting this from.
> We used our old 2.1.6 server settings. With torque 2.1.9 as configured
> above, when requesting nodes we would get only 1 node per job, we could
> request 1 or 300 nodes, torque would only assign 1 node per job.
>
> Any help in this regard would be much appreciated.
So you've got 2 problems? The latency issues that came when you upgraded the
kernel, and the scheduling problems. The later is probably because of a
default_resources.ncpus.
Not that if you are using Maui, then torque isn't assigning nodes. You can use
'checkjob' to see what Maui is assigning.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20071106/f13e86c9/attachment.bin
More information about the torqueusers
mailing list