[torqueusers] Post job file processing error? Jobs not running. :(

Garrick Staples garrick at usc.edu
Tue Mar 14 14:14:28 MST 2006


On Tue, Mar 14, 2006 at 02:34:25PM -0500, Aquarijen alleged:
> What would these stale mom processes look like?  If I start the
> pbs_mom, and then shut it down, it leaves the mom.lock out there and
> if I cat the mom.lock, there is no process with that number running on
> the machine.  There are also no processes with "mom" or "pbs" in the
> name.  I shut them all down and left them down for a few hours.  Upon
> restart, I have the same problems (was thinking maybe something would
> time out?).  Is the pbs_mom supposed to leave the lock out there after
> being shut down?

They would look like "pbs_mom" and since you looked, obviously that
isn't the problem.  Yes, pbs_mom leaves the lockfile after it exits, but
that doesn't matter.

You didn't mention it, but I'm assuming you are on 2.0.0p8.


> > > 03/13/2006 19:46:06;0010;PBS_Server;Job;326.b08l02.oic.ornl.gov;Exit_status=-2
> > > 03/13/2006 19:46:06;000d;PBS_Server;Job;326.b08l02.oic.ornl.gov;Post
> > > job file processing error; job 326.b08l02.oic.ornl.gov on host
> > > b08n079.oic.ornl.gov/1+b08n079.oic.ornl.gov/0

This means exactly what it says, MOM reports that job start failed after
any files were staged in.


> > > 03/13/2006 19:46:06;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;dequeuing
> > > from workq, state COMPLETE
> > > 03/13/2006 19:46:06;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
> > > sent command term
> > >
> > > The log on the mom says:
> > >
> > > 03/13/2006 19:25:08;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > > 03/13/2006 19:26:35;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > > started, Failure job exec failure, after files staged, no retry

This means the MOM child process that was to become the job errored.
If it was early in the process, then the errors go to syslog.  If it was
later, then it goes to the job's stderr.


> > > 03/13/2006 19:26:35;0008;   pbs_mom;Job;323.b08l02.oic.ornl.gov;Job
> > > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> > > 03/13/2006 19:40:14;0002;   pbs_mom;Svr;Log;Log opened
> > > 03/13/2006 19:40:14;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily
> > > unavailable (11) in pbs_mom, cannot lock
> > > '/var/spool/pbs/mom_priv/mom.lock' - another mom running

This worries me, and is why I originally thought stale MOM processes
were somehow hanging around.  Did you just mistakenly start a new MOM
without killing the old?


> > > I have stopped the pbs_moms and removed the mom.locks but I see the
> > > same behavior and I then see this in the logs on the mom:

Don't worry about removing the lock files.  If MOM has exited, the new
MOM will get a new lock.


> > > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
> > > leaving jobs running, just exiting
> > > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;Is down
> > > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;Log;Log closed
> > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;Log;Log opened
> > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
> > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;restricted;b08l02
> > > 03/13/2006 20:00:36;0002;   pbs_mom;n/a;initialize;independent
> > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;pbs_mom;Is up

No lockfile error this time, and the previous shutdown is recorded.
This looks normal.


> > > 03/13/2006 20:00:36;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > > 03/13/2006 20:01:35;0002;   pbs_mom;Svr;im_eof;End of File from addr
> > > 172.16.3.253:15001
> > > 03/13/2006 20:01:35;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > > 03/13/2006 20:01:42;0001;   pbs_mom;Svr;is_request;duplicate
> > > connection from 172.16.3.253:1023 - closing original connection

Don't worry about this stuff.  That's just the new MOM getting synced
with pbs_server.


> > > 03/13/2006 20:02:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > > started, Failure job exec failure, after files staged, no retry

Again, check syslog and/or the job's stderr for the exact message.


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060314/e78ca8ed/attachment.bin


More information about the torqueusers mailing list