[torqueusers] Torque MOMs terminating jobs immediately after starting

Prakash Velayutham prakash.velayutham at cchmc.org
Tue Apr 14 06:35:46 MDT 2009


Hello folks,

Wanted to come back and report that this turned out to be a Torque bug  
that CR fixed for me.

My MOM config file has the following directive which was the culprit  
that exposed the bug.

$job_output_file_umask	userdefault

This is supposed to set the umask of the job's output files to what  
the user's default umask is.

The code that determines the user's default umask had a "popen"  
without a matching "pclose" call which was causing the jobs to  
terminate immediately after being started on the MOM. I am not sure  
why this happened only in the interactive jobs (that is what we  
observed AFAIK).

Once the above line was changed to
$job_output_file_umask 022
the issue disappeared without even the patch CR provided.

Prakash

On Mar 2, 2009, at 12:07 PM, Prakash Velayutham wrote:

> Hello,
>
> I am running Torque 2.3.6 (in --ha mode if that makes any difference).
>
> Only randomly I am seeing the following behaviour with the MOMs.
>
> A job would be accepted by the server and scheduled by the Moab  
> (5.3.1) scheduler, but then after the job is shipped to the compute  
> node, it gets terminated right away by the node with the following  
> in its log.
>
> 03/02/2009 12:00:26;0001;   pbs_mom;Job;TMomFinalizeJob3;job  
> 5619.bmiclustersvcd1.cchmc.org started, pid = 22521
> 03/02/2009 12:00:26;0008;   pbs_mom;Job; 
> 5619.bmiclustersvcd1.cchmc.org;Job Modified at request of PBS_Server at bmiclustersvcd1.cchmc.org
> 03/02/2009 12:00:26;0008;   pbs_mom;Job; 
> 5619.bmiclustersvcd1.cchmc.org;kill_task: killing pid 22863 task 1  
> gracefully with sig 15
> 03/02/2009 12:00:26;0080;   pbs_mom;Job; 
> 5619.bmiclustersvcd1.cchmc.org;scan_for_terminated: job  
> 5619.bmiclustersvcd1.cchmc.org task 1 terminated, sid=22521
> 03/02/2009 12:00:26;0008;   pbs_mom;Job; 
> 5619.bmiclustersvcd1.cchmc.org;job was terminated
> 03/02/2009 12:00:26;0080;   pbs_mom;Svr;preobit_reply;top of  
> preobit_reply
> 03/02/2009 12:00:26;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
> decode_DIS_replySvr worked, top of while loop
> 03/02/2009 12:00:26;0080;   pbs_mom;Svr;preobit_reply;in while loop,  
> no error from job stat
> 03/02/2009 12:00:26;0008;   pbs_mom;Job; 
> 5619.bmiclustersvcd1.cchmc.org;checking job post-processing routine
> 03/02/2009 12:00:26;0080;   pbs_mom;Job; 
> 5619.bmiclustersvcd1.cchmc.org;obit sent to server
>
> As I mentioned earlier, this is very random. It does this even with  
> interactive jobs, so it is not something to do with the batch  
> scripts. See below:
>
> velge9 at bmiclusterd1:~> qsub -I -lnodes=bmi-xeon3-04
> qsub: waiting for job 5620.bmiclustersvcd1.cchmc.org to start
> qsub: job 5620.bmiclustersvcd1.cchmc.org ready
>
>
> qsub: job 5620.bmiclustersvcd1.cchmc.org completed
> velge9 at bmiclusterd1:~>
>
> I am baffled. Any help appreciated.
>
> Thanks,
> Prakash
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list