[torqueusers] Problems upgrading from 2.4 to 2.5

J.A. Magallón jamagallon at ono.com
Tue Nov 30 17:40:45 MST 2010


On Mon, 29 Nov 2010 21:31:23 -0500, Glen Beane <glen.beane at gmail.com> wrote:

> On Mon, Nov 29, 2010 at 10:07 AM, J.A. Magallón <jamagallon at ono.com> wrote:
> > Hi all...
> >
> > First of all, hi to everyone, I'm new to the list.
> > I usually have solved my problems with torque with some googling, but this
> > is driving me nuts.
> >
> > I have benn using torque 2.4 for sometime, and everything works fine, But
> > now my distro has updraded torque from 2.4.8 to 2.5.3, and I face a curious
> > problem.
> >
> > I have reduced the problem to a simple test, with just one only node and
> > a simple and stupid queue:
> >
> > Queue            Memory CPU Time Walltime Node  Run Que Lm  State
> > ---------------- ------ -------- -------- ----  --- --- --  -----
> > std                --      --       --      --    0   0 10   E R
> >
> > No limits, no nothing. Box is a quad core cpu.
> >
> > With a simple job:
> >
> > werewolf:~/dev/mpi/tst> cat k
> > #!/bin/bash
> > #PBS -N x
> > #PBS -S /bin/bash
> > #PBS -j oe
> >
> > echo "server:" $PBS_SERVER
> > echo "queue: " $PBS_QUEUE
> > echo "client:" $PBS_O_HOST
> > echo "cwd:   " $PBS_O_WORKDIR
> >
> > echo "nodefile<"$PBS_NODEFILE">:"
> > cat $PBS_NODEFILE
> >
> > sleep 30
> >
> > with torque 2.4, I could do this:
> >
> > werewolf:~/dev/mpi/tst> qsub -l nodes=1:ppn=2 k
> > 0.werewolf.home
> >
> > (what I really do is running MPI with mpirun -pernode...)
> >
> > But with torque 2.5, this does not work anymore:
> >
> > erewolf:~/dev/mpi/tst> qsub -l nodes=1:ppn=2 k
> > qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes
> >
> > Uh ? What has changed ? It looks like 2.5 ignores that box has 4 cores...
> >
> > Any idea ? Some behavior has changed, is it a bug, or should it work
> > and perhaps its a packaging/compiler issue ?
> ster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> 
> what does the pbs_server nodes file look like?  What do you see when
> you run "pbs_nodes -a"?

werewolf:/var/spool/pbs# cat server_name
localhost
werewolf:/var/spool/pbs# cat server_priv/nodes
localhost np=4

werewolf:~/dev/mpi/tst> qnodes -a
localhost
     state = free
     np = 4
     ntype = cluster
     status = rectime=1291163278,varattr=,jobs=,state=free,netload=5762336355,gres=,loadave=0.14,ncpus=4,physmem=8195092kb,availmem=9523728kb,totmem=10211244kb,idletime=1577,nusers=3,nsessions=14,sessions=7347 3478 3975 935 7240 7267 8412 8422 8435 8447 8506 8532 22142 24881,uname=Linux werewolf.home 2.6.36.1-desktop-1mnb #1 SMP Tue Nov 23 01:22:32 CET 2010 x86_64,opsys=linux

It is something related to ppn:

werewolf:~/dev/mpi/tst> qsub -l nodes=1 k
3.werewolf.home
werewolf:~/dev/mpi/tst> qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
3.werewolf                 x                magallon               0 R std

werewolf:~/dev/mpi/tst> qsub -l nodes=1:ppn=2 k
qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes

So, nodes=1 works, nodes=1:ppn=1 works, but nodes=1:ppn=2 fails... and torque
knows localhost has 4 processors.

werewolf:~/dev/mpi/tst> qmgr -c 'p q std'
#
# Create queues and set their attributes.
#
#
# Create and define queue std
#
create queue std
set queue std queue_type = Execution
set queue std max_running = 10
set queue std enabled = True
set queue std started = True

Curious...

-- 
J.A. Magallon <jamagallon()ono!com>     \               Software is like sex:
                                         \         It's better when it's free


More information about the torqueusers mailing list