[torqueusers] Job arrays problems
R. David
david at unistra.fr
Tue Mar 29 03:56:24 MDT 2011
Hello,
We use Torque 2.5.4 and Maui 3.2.6p21 on a Centos Linux 5.3.
Since several weeks, we have strange problems with jobs arrays (-t syntax).
When users submit arrays, everything seems to work fine. After a few days/weeks, things start to get weird. From the batch systems point of view, the nodes are occupied by array jobs instances (say 1234[57]). From the operating system point of view, the nodes are empty.
This leads to the batch system being unable to run jobs on these nodes. For instance, Maui complains with :
Messages: cannot start job - RM failure, rc: 15046, msg: 'Resource temporarily unavailable REJHOST=XXX MSG=cannot allocate node 'XXX' to job - node not currently available (nps needed/free: 12/0, gpus needed/free: 0/0, joblist: 50402[7]:0,50402[7]:1,50402[7]:2,50402[7].:3,50402[7]:4,504'
It seems to me nothing appears in the pbs_mom or pbs_server log files.
Do you succeed in operating job arrays on your site ? Did you have similar problems ?
Regards,
---------------------------------------------------------
R. David - david at unistra.fr
Responsable du meso-centre
UdS / Direction Informatique
Tel. : 03 68 85 45 48
---------------------------------------------------------
More information about the torqueusers
mailing list