[torqueusers] Torque MOMs terminating jobs immediately after starting
Prakash Velayutham
prakash.velayutham at cchmc.org
Tue Apr 14 06:35:46 MDT 2009
Hello folks,
Wanted to come back and report that this turned out to be a Torque bug
that CR fixed for me.
My MOM config file has the following directive which was the culprit
that exposed the bug.
$job_output_file_umask userdefault
This is supposed to set the umask of the job's output files to what
the user's default umask is.
The code that determines the user's default umask had a "popen"
without a matching "pclose" call which was causing the jobs to
terminate immediately after being started on the MOM. I am not sure
why this happened only in the interactive jobs (that is what we
observed AFAIK).
Once the above line was changed to
$job_output_file_umask 022
the issue disappeared without even the patch CR provided.
Prakash
On Mar 2, 2009, at 12:07 PM, Prakash Velayutham wrote:
> Hello,
>
> I am running Torque 2.3.6 (in --ha mode if that makes any difference).
>
> Only randomly I am seeing the following behaviour with the MOMs.
>
> A job would be accepted by the server and scheduled by the Moab
> (5.3.1) scheduler, but then after the job is shipped to the compute
> node, it gets terminated right away by the node with the following
> in its log.
>
> 03/02/2009 12:00:26;0001; pbs_mom;Job;TMomFinalizeJob3;job
> 5619.bmiclustersvcd1.cchmc.org started, pid = 22521
> 03/02/2009 12:00:26;0008; pbs_mom;Job;
> 5619.bmiclustersvcd1.cchmc.org;Job Modified at request of PBS_Server at bmiclustersvcd1.cchmc.org
> 03/02/2009 12:00:26;0008; pbs_mom;Job;
> 5619.bmiclustersvcd1.cchmc.org;kill_task: killing pid 22863 task 1
> gracefully with sig 15
> 03/02/2009 12:00:26;0080; pbs_mom;Job;
> 5619.bmiclustersvcd1.cchmc.org;scan_for_terminated: job
> 5619.bmiclustersvcd1.cchmc.org task 1 terminated, sid=22521
> 03/02/2009 12:00:26;0008; pbs_mom;Job;
> 5619.bmiclustersvcd1.cchmc.org;job was terminated
> 03/02/2009 12:00:26;0080; pbs_mom;Svr;preobit_reply;top of
> preobit_reply
> 03/02/2009 12:00:26;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/
> decode_DIS_replySvr worked, top of while loop
> 03/02/2009 12:00:26;0080; pbs_mom;Svr;preobit_reply;in while loop,
> no error from job stat
> 03/02/2009 12:00:26;0008; pbs_mom;Job;
> 5619.bmiclustersvcd1.cchmc.org;checking job post-processing routine
> 03/02/2009 12:00:26;0080; pbs_mom;Job;
> 5619.bmiclustersvcd1.cchmc.org;obit sent to server
>
> As I mentioned earlier, this is very random. It does this even with
> interactive jobs, so it is not something to do with the batch
> scripts. See below:
>
> velge9 at bmiclusterd1:~> qsub -I -lnodes=bmi-xeon3-04
> qsub: waiting for job 5620.bmiclustersvcd1.cchmc.org to start
> qsub: job 5620.bmiclustersvcd1.cchmc.org ready
>
>
> qsub: job 5620.bmiclustersvcd1.cchmc.org completed
> velge9 at bmiclusterd1:~>
>
> I am baffled. Any help appreciated.
>
> Thanks,
> Prakash
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
More information about the torqueusers
mailing list