[torqueusers] 100+ job lauch failures - 15009 errors.

Amitoj G Singh amitoj at cs.uh.edu
Tue Nov 6 13:52:27 MST 2007


Introduction
============
o Torque version: 2.1.6
o Maui version: 3.2.6p18
o Server kernel version: 2.6.9-55.0.2.ELsmp
o Worker kernel version: 2.6.21
o 600-node AMD dual-core dual-socket cluster, 4GB memory per worker node,
fast Ethernet.
o 520-node Intel P4 single-core cluster, 1GB memory per worker node, fast
Ethernet.

Problem
========

The server serves both the 600-node and the 520-node cluster, each cluster
has a different queue. When submitting a 100+ node job to either queue,
the job goes from Q -> R -> Q and the following messages appear on the job
worker-mom-head-node and mom-worker-nodes ..

pbs_mom;Svr;pbs_mom;Job with requested ID already exists (15009) in
im_request, KILL/ABORT request for job 202902 returned error

Doing several "qrun <jobid>" kicks the job off finally. We configured
Torque 2.1.6 with the following:

./configure --prefix=/usr/local/pbs --enable-docs --enable-server
--enable-mom --disable-gui --with-server-home=/var/spool/PBS
--with-default-server=foobar --with-rcp=/usr/bin/rcp

What started this
=================
Everything was working fine until we upgraded the worker kernels from
2.6.12.6 to 2.6.21.

Our diagnosis attempts
======================
Building torque 2.1.9 with ...

./configure --prefix=/usr/local/pbs --enable-docs --enable-server
--enable-mom --disable-gui --with-server-home=/var/spool/PBS
--with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp

We used our old 2.1.6 server settings. With torque 2.1.9 as configured
above, when requesting nodes we would get only 1 node per job, we could
request 1 or 300 nodes, torque would only assign 1 node per job.

Any help in this regard would be much appreciated.



More information about the torqueusers mailing list