[torqueusers] mom_job_sync not working???

Eva Hocks hocks at sdsc.edu
Fri Aug 2 14:02:32 MDT 2013



I am running torque 3.0.5 and have set the mom_job_sync  = True (which
is the default since 2.0 anyway)

Nevertheless, jobs in state E do not get removed after the node was
resinstalled and obvioulsy does not know about the job any more.


from the server logs:

07/31/2013 11:24:10;0008;PBS_Server;Job;236751.tscc-mgr.local;Job deleted at request of gpratt at tscc-login2.local
07/31/2013 11:24:10;0008;PBS_Server;Job;236751.tscc-mgr.local;Job sent signal SIGTERM on delete
07/31/2013 11:24:18;0008;PBS_Server;Job;236751.tscc-mgr.local;Job sent signal SIGKILL on delete
07/31/2013 11:25:18;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::job nanny, exiting job '236751.tscc-mgr.local' still exists,
 sending a SIGKILL
07/31/2013 11:29:24;0010;PBS_Server;Job;236751.tscc-mgr.local;Exit_status=271 resources_used.cput=00:01:02 resources_used.mem=78792kb resources_used.vmem=151808kb resources_used.walltime=00:11:01
07/31/2013 12:36:35;0080;PBS_Server;Job;236751.tscc-mgr.local;Request invalid for state of job EXITING


and the job never gets cleaned out. From the manual "If a job exists on
a compute node in a pre-execution or corrupt state, it will be
automatically cleaned up and purged."  mom_job_sync should sent a purge
request instead of a qdel.

 Am I missing something?

Thanks
Eva



More information about the torqueusers mailing list