[torqueusers] prologue not executing for a subset of jobs

Liam Forbes lforbes at arsc.edu
Thu Feb 13 13:27:49 MST 2014


Hello!  We’re running 4.2.5 from Penguin Computing and Moab 7.2.1.

This week, we noticed a number of jobs did not perform the steps in our prologue script.  Looking into it, I found that for these jobs, the prologue had not logged it’s start and end time in syslog from the mother superior node.  I believe these jobs did not execute the prologue, but the job still executed.  The prologue ran fine for other jobs during the same time period.  Also, the prologue.parallel script for all jobs executed on the sister nodes.

Looking at the torque logs, the jobs that appear not to have executed their prologue do not have job started records in the mother superior's mom log.  They also logged a “Post job file processing error” message in the server log.  I believe in all cases, I had to clean up the jobs with `momctl -h -c`.  In the mom logs, the first log messages from the mother superior are either of the form

02/11/2014 12:09:20;0001;   pbs_mom.200106;Job;exec_bail;bailing on job 756419.scyld code -3
02/11/2014 12:09:20;0008;   pbs_mom.200106;Req;send_sisters;sending ABORT to sisters for job 756419.scyld
02/11/2014 12:09:21;0001;   pbs_mom.200106;Job;im_request;10.54.50.83:223 sent an abort. Killing job 756419.scyld
02/11/2014 12:09:22;0080;   pbs_mom.200106;Job;756419.scyld;obit sent to server

or

02/11/2014 13:15:47;0001;   pbs_mom.33954;Job;examine_all_running_jobs;job 756438.scyld already examined. substate=40
02/11/2014 13:15:47;0001;   pbs_mom.33954;Job;exec_bail;bailing on job 756438.scyld code -4
02/11/2014 13:15:47;0008;   pbs_mom.33954;Req;send_sisters;sending ABORT to sisters for job 756438.scyld
02/11/2014 13:15:47;0080;   pbs_mom.33954;Job;756438.scyld;obit sent to server
02/11/2014 13:15:48;0080;   pbs_mom.33954;Req;req_reject;Reject reply code=15013(PBS server internal error REJHOST=n40 MSG=cannot open '/var/spool/torque/mom_priv/jobs/756438.scyld.SC' errno=17 - File exists), aux=0, type=JobScript, from PBS_Server at scyld

What do these mom log messages mean?

How can I confirm via torque (the mom daemon?) whether or not the prologue executes on the mother superior?  I assume I can’t do anything about the jobs that have already gone away; I can only try to catch the problem occurring in the future.

The full server and mom logs for each job are attached.

Any assistance is appreciated.

Regards,
-liam

-There are uncountably more irrational fears than rational ones. -P. Dolan
Liam Forbes             Senior HPC Systems Analyst,           LPIC1, CISSP
ARSC, U of AK, Fairbanks   lforbes at arsc.edu 907-450-8618 fax: 907-450-8605

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pbs server logs.txt
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20140213/ac4d04b7/attachment-0002.txt 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: pbs mom logs.txt
Url: http://www.supercluster.org/pipermail/torqueusers/attachments/20140213/ac4d04b7/attachment-0003.txt 


More information about the torqueusers mailing list