[torqueusers] Post job file processing error? Jobs not running. :(

Aquarijen aquarijen at gmail.com
Wed Mar 15 07:42:18 MST 2006


Thank you Garrick. I'll check syslog.  That's a good lead.  I
appreciate the help and the explanations.
-Jen

On 3/14/06, Garrick Staples <garrick at usc.edu> wrote:
> On Tue, Mar 14, 2006 at 02:34:25PM -0500, Aquarijen alleged:
> > What would these stale mom processes look like?  If I start the
> > pbs_mom, and then shut it down, it leaves the mom.lock out there and
> > if I cat the mom.lock, there is no process with that number running on
> > the machine.  There are also no processes with "mom" or "pbs" in the
> > name.  I shut them all down and left them down for a few hours.  Upon
> > restart, I have the same problems (was thinking maybe something would
> > time out?).  Is the pbs_mom supposed to leave the lock out there after
> > being shut down?
>
> They would look like "pbs_mom" and since you looked, obviously that
> isn't the problem.  Yes, pbs_mom leaves the lockfile after it exits, but
> that doesn't matter.
>
> You didn't mention it, but I'm assuming you are on 2.0.0p8.
>
>
> > > > 03/13/2006 19:46:06;0010;PBS_Server;Job;326.b08l02.oic.ornl.gov;Exit_status=-2
> > > > 03/13/2006 19:46:06;000d;PBS_Server;Job;326.b08l02.oic.ornl.gov;Post
> > > > job file processing error; job 326.b08l02.oic.ornl.gov on host
> > > > b08n079.oic.ornl.gov/1+b08n079.oic.ornl.gov/0
>
> This means exactly what it says, MOM reports that job start failed after
> any files were staged in.
>
>
> > > > 03/13/2006 19:46:06;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;dequeuing
> > > > from workq, state COMPLETE
> > > > 03/13/2006 19:46:06;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
> > > > sent command term
> > > >
> > > > The log on the mom says:
> > > >
> > > > 03/13/2006 19:25:08;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > > > 03/13/2006 19:26:35;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > > > started, Failure job exec failure, after files staged, no retry
>
> This means the MOM child process that was to become the job errored.
> If it was early in the process, then the errors go to syslog.  If it was
> later, then it goes to the job's stderr.
>
>
> > > > 03/13/2006 19:26:35;0008;   pbs_mom;Job;323.b08l02.oic.ornl.gov;Job
> > > > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> > > > 03/13/2006 19:40:14;0002;   pbs_mom;Svr;Log;Log opened
> > > > 03/13/2006 19:40:14;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily
> > > > unavailable (11) in pbs_mom, cannot lock
> > > > '/var/spool/pbs/mom_priv/mom.lock' - another mom running
>
> This worries me, and is why I originally thought stale MOM processes
> were somehow hanging around.  Did you just mistakenly start a new MOM
> without killing the old?
>
>
> > > > I have stopped the pbs_moms and removed the mom.locks but I see the
> > > > same behavior and I then see this in the logs on the mom:
>
> Don't worry about removing the lock files.  If MOM has exited, the new
> MOM will get a new lock.
>
>
> > > > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
> > > > leaving jobs running, just exiting
> > > > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;Is down
> > > > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;Log;Log closed
> > > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;Log;Log opened
> > > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
> > > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;restricted;b08l02
> > > > 03/13/2006 20:00:36;0002;   pbs_mom;n/a;initialize;independent
> > > > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;pbs_mom;Is up
>
> No lockfile error this time, and the previous shutdown is recorded.
> This looks normal.
>
>
> > > > 03/13/2006 20:00:36;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > > > 03/13/2006 20:01:35;0002;   pbs_mom;Svr;im_eof;End of File from addr
> > > > 172.16.3.253:15001
> > > > 03/13/2006 20:01:35;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > > > 03/13/2006 20:01:42;0001;   pbs_mom;Svr;is_request;duplicate
> > > > connection from 172.16.3.253:1023 - closing original connection
>
> Don't worry about this stuff.  That's just the new MOM getting synced
> with pbs_server.
>
>
> > > > 03/13/2006 20:02:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > > > started, Failure job exec failure, after files staged, no retry
>
> Again, check syslog and/or the job's stderr for the exact message.
>
>
> --
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>


More information about the torqueusers mailing list