[torqueusers] Job submisson: requesting vmem/mem resource

Michael Lackner michael.lackner at unileoben.ac.at
Thu Jun 28 01:11:41 MDT 2012


Greetings!

I guess this stuff may have been discussed already, but I just couldn't find
the answer for my problem anywhere in the documentation or on the web, so
maybe somebody here knows or can point me in the right direction..

I am trying to set up torque with different kinds of nodes and so far
everything is running quite fine. The nodes are:

* 4 machines with CentOS 5.8 32-Bit, 3GB available RAM
* 1 machine with CentOS 5.8 64-Bit, 16GB available RAM
* 2 machines with CentOS 5.8 64-Bit, 72GB available RAM

Software:

* torque-2.1.9-1.el5.kb
* maui-3.3-4.el5

Now I have defined an arch resource, so users can specifically choose to
let their job run on a 32-Bit or 64-Bit machine when requesting an arch
using qsub/xpbs. Works perfectly fine.

But what I would also want to do is allow users to request mem or vmem as a
resource, so smaller 64-Bit jobs go to the 16GB machine and bigger ones to
the large 72GB machines, as I always want torqe to choose the weakest or
smallest machine that can still do a certain job first and only send a job
to a larger machine if really necessary.

So in my job script I would have something like:

#PBS -l mem=6gb

or:

#PBS -l vmem=6gb

Now the weird thing is, it works as long as the requested memory is
below the available memory of the smallest nodes with 3GB. So say, I do
mem=2gb, it works fine. But as soon as I request say mem=6gb, the job
would always fail and I get nothing on stdout/stderr.

I've also tried to set a default vmem of 72GB on the queue and then
request vmem=6gb, but the job still got terminated. I also read in the
documentation, that one needs to request at least nodes=1 on Linux so
that mem/vmem may be requested successfully, so I tried that too, to no
avail:

#PBS -l mem=6gb,nodes=1

or:

#PBS -l vmem=6gb,nodes=1

I checked mom_logs, but it doesn't tell me anything helpful (i replaced
our hostname in the log with "<serverhost>"), this is from the node that
was supposed to execute the job:

==
06/28/2012 08:52:10;0001;   pbs_mom;Job;TMomFinalizeJob3;job not started, Failure job exec
failure, after files staged, no retry
06/28/2012 08:52:10;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
06/28/2012 08:52:10;0008;   pbs_mom;Job;154.<serverhost>;Job Modified at request of
PBS_Server@<serverhost>
==

server_logs don't say anything. The maui log only talks about starting,
but not about terminating the job, no errors or warnings to be seen.

Maybe somebody could advise on how to locate the error I'm making here?

Thanks a lot!

-- 
Michael Lackner
Lehrstuhl für Informationstechnologie (CiT)
Montanuniversität Leoben
Tel.: +43 (0)3842/402-1505 | Mail: michael.lackner at unileoben.ac.at
Fax.: +43 (0)3842/402-1502 | Web: http://institute.unileoben.ac.at/infotech


More information about the torqueusers mailing list