[torqueusers] Torque MOMs terminating jobs immediately after starting

Prakash Velayutham prakash.velayutham at cchmc.org
Mon Mar 2 10:07:15 MST 2009


Hello,

I am running Torque 2.3.6 (in --ha mode if that makes any difference).

Only randomly I am seeing the following behaviour with the MOMs.

A job would be accepted by the server and scheduled by the Moab  
(5.3.1) scheduler, but then after the job is shipped to the compute  
node, it gets terminated right away by the node with the following in  
its log.

03/02/2009 12:00:26;0001;   pbs_mom;Job;TMomFinalizeJob3;job  
5619.bmiclustersvcd1.cchmc.org started, pid = 22521
03/02/2009 12:00:26;0008;   pbs_mom;Job; 
5619.bmiclustersvcd1.cchmc.org;Job Modified at request of PBS_Server at bmiclustersvcd1.cchmc.org
03/02/2009 12:00:26;0008;   pbs_mom;Job; 
5619.bmiclustersvcd1.cchmc.org;kill_task: killing pid 22863 task 1  
gracefully with sig 15
03/02/2009 12:00:26;0080;   pbs_mom;Job; 
5619.bmiclustersvcd1.cchmc.org;scan_for_terminated: job  
5619.bmiclustersvcd1.cchmc.org task 1 terminated, sid=22521
03/02/2009 12:00:26;0008;   pbs_mom;Job; 
5619.bmiclustersvcd1.cchmc.org;job was terminated
03/02/2009 12:00:26;0080;   pbs_mom;Svr;preobit_reply;top of  
preobit_reply
03/02/2009 12:00:26;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
decode_DIS_replySvr worked, top of while loop
03/02/2009 12:00:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,  
no error from job stat
03/02/2009 12:00:26;0008;   pbs_mom;Job; 
5619.bmiclustersvcd1.cchmc.org;checking job post-processing routine
03/02/2009 12:00:26;0080;   pbs_mom;Job; 
5619.bmiclustersvcd1.cchmc.org;obit sent to server

As I mentioned earlier, this is very random. It does this even with  
interactive jobs, so it is not something to do with the batch scripts.  
See below:

velge9 at bmiclusterd1:~> qsub -I -lnodes=bmi-xeon3-04
qsub: waiting for job 5620.bmiclustersvcd1.cchmc.org to start
qsub: job 5620.bmiclustersvcd1.cchmc.org ready


qsub: job 5620.bmiclustersvcd1.cchmc.org completed
velge9 at bmiclusterd1:~>

I am baffled. Any help appreciated.

Thanks,
Prakash


More information about the torqueusers mailing list