[torqueusers] Job arrays problems

R. David david at unistra.fr
Tue Mar 29 03:56:24 MDT 2011


We use Torque 2.5.4 and Maui 3.2.6p21 on a Centos Linux 5.3.

Since several weeks, we have strange problems with jobs arrays (-t syntax).

When users submit arrays, everything seems to work fine. After a few days/weeks, things start to get weird. From the batch systems point of view, the nodes are occupied by array jobs instances (say 1234[57]). From the operating system point of view, the nodes are empty.

This leads to the batch system being unable to run jobs on these nodes. For instance, Maui complains with :

Messages:  cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable REJHOST=XXX MSG=cannot allocate node 'XXX' to job - node not currently available (nps needed/free: 12/0, gpus needed/free: 0/0, joblist: 50402[7]:0,50402[7]:1,50402[7]:2,50402[7].:3,50402[7]:4,504'

It seems to me nothing appears in the pbs_mom or pbs_server log files.

Do you succeed in operating job arrays on your site ? Did you have similar problems ?


  R. David - david at unistra.fr
  Responsable du meso-centre 
  UdS / Direction Informatique
  Tel. : 03 68 85 45 48 

More information about the torqueusers mailing list