[torqueusers] MOM rejected/rc=1

Eva Hocks hocks at sdsc.edu
Wed Jun 12 12:04:15 MDT 2013




We are running torque version: 3.0.5 with cpuset enabled. It seems when
a job gets terminated mom reports node free to the server
while it is still cleaning up the previous terminated job:


06/12/2013 10:24:29;0008;   pbs_mom;Job;7762.tscc-mgr.local;job was terminated
06/12/2013 10:24:48;0008;   pbs_mom;Job;7762.tscc-mgr.local;ERROR:    received request 'ALL_OKAY' from 10.1.255.165:15003 for job '7762.tscc-mgr.local' (job does not exist locally)
06/12/2013 10:25:20;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 3.0.5, loglevel = 0

starting the next job on this node:
06/12/2013 10:25:27;0001;   pbs_mom;Job;TMomFinalizeJob3;job 7764.tscc-mgr.local started, pid = 13955



job 7763 started while tscc-1-16 was still cleaning the Unused
cpuset from the previous job.

06/12/2013 10:24:41  S    Job Run at request of maui at tscc-mgr.local
06/12/2013 10:24:47  S    send of job to tscc-1-16 failed error = 15091
  Timed out wating for a reply (15091) in send_job, child failed in previous commit request

This happens about 62 times in 10 hours on our test system.

Is there a flag in the torque mom or server configuration to delay the
"free" state of the mom to the server? This was never a problem without
cpuset enabaled.


Thanks for any help,
Eva



More information about the torqueusers mailing list