[torqueusers] Post job file processing error? Jobs not running. :(

Aquarijen aquarijen at gmail.com
Tue Mar 14 12:34:25 MST 2006


What would these stale mom processes look like?  If I start the
pbs_mom, and then shut it down, it leaves the mom.lock out there and
if I cat the mom.lock, there is no process with that number running on
the machine.  There are also no processes with "mom" or "pbs" in the
name.  I shut them all down and left them down for a few hours.  Upon
restart, I have the same problems (was thinking maybe something would
time out?).  Is the pbs_mom supposed to leave the lock out there after
being shut down?

Thanks!
Jen

On 3/14/06, Garrick Staples <garrick at usc.edu> wrote:
> Sounds like you are starting new MOM daemons while some stale MOM
> processes are still alive.
>
> On Mon, Mar 13, 2006 at 08:06:25PM -0500, Aquarijen alleged:
> > Hi All,
> >
> > I am having quite a time getting things running here, so I was really
> > hoping someone could help.
> >
> > I have several submit hosts, a pbs_server, maui and 269 pbs_moms on
> > different nodes.
> >
> > A user tries to submit a job.  If you look at the queue right at that
> > time, you will see the job breifly queued, but it never runs.  I also
> > get no output or email.  Looking at the server_logs, I see this on a
> > failed job:
> >
> > 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b05l01.oic.ornl.gov, sock=14
> > 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type QueueJob request
> > received from 2vt at b05l01.oic.ornl.gov, sock=11
> > 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type JobScript request
> > received from 2vt at b05l01.oic.ornl.gov, sock=11
> > 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type ReadyToCommit request
> > received from 2vt at b05l01.oic.ornl.gov, sock=11
> > 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type Commit request received
> > from 2vt at b05l01.oic.ornl.gov, sock=11
> > 03/13/2006 19:46:05;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;enqueuing
> > into workq, state 1 hop 1
> > 03/13/2006 19:46:05;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> > Queued at request of 2vt at b05l01.oic.ornl.gov, owner =
> > 2vt at b05l01.oic.ornl.gov, job name = parallel-worlds-jen, queue = workq
> > 03/13/2006 19:46:05;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
> > sent command new
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusNode request
> > received from root at b08l02.oic.ornl.gov, sock=10
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusQueue request
> > received from root at b08l02.oic.ornl.gov, sock=10
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusJob request
> > received from root at b08l02.oic.ornl.gov, sock=10
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type ModifyJob request
> > received from root at b08l02.oic.ornl.gov, sock=10
> > 03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> > Modified at request of root at b08l02.oic.ornl.gov
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type RunJob request received
> > from root at b08l02.oic.ornl.gov, sock=10
> > 03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> > Run at request of root at b08l02.oic.ornl.gov
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type ModifyJob request
> > received from root at b08l02.oic.ornl.gov, sock=10
> > 03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> > Modified at request of root at b08l02.oic.ornl.gov
> > 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type JobObituary request
> > received from pbs_mom at b08n079.oic.ornl.gov, sock=14
> > 03/13/2006 19:46:06;0010;PBS_Server;Job;326.b08l02.oic.ornl.gov;Exit_status=-2
> > 03/13/2006 19:46:06;000d;PBS_Server;Job;326.b08l02.oic.ornl.gov;Post
> > job file processing error; job 326.b08l02.oic.ornl.gov on host
> > b08n079.oic.ornl.gov/1+b08n079.oic.ornl.gov/0
> > 03/13/2006 19:46:06;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;dequeuing
> > from workq, state COMPLETE
> > 03/13/2006 19:46:06;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
> > sent command term
> >
> > The log on the mom says:
> >
> > 03/13/2006 19:08:44;0002;   pbs_mom;Svr;pbs_mom;Is down
> > 03/13/2006 19:08:44;0002;   pbs_mom;Svr;Log;Log closed
> > 03/13/2006 19:25:08;0002;   pbs_mom;Svr;Log;Log opened
> > 03/13/2006 19:25:08;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
> > 03/13/2006 19:25:08;0002;   pbs_mom;Svr;restricted;b08l02
> > 03/13/2006 19:25:08;0002;   pbs_mom;n/a;initialize;independent
> > 03/13/2006 19:25:08;0002;   pbs_mom;Svr;pbs_mom;Is up
> > 03/13/2006 19:25:08;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > 03/13/2006 19:26:35;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > started, Failure job exec failure, after files staged, no retry
> > 03/13/2006 19:26:35;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> > 03/13/2006 19:26:35;0008;   pbs_mom;Job;323.b08l02.oic.ornl.gov;Job
> > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> > 03/13/2006 19:40:14;0002;   pbs_mom;Svr;Log;Log opened
> > 03/13/2006 19:40:14;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily
> > unavailable (11) in pbs_mom, cannot lock
> > '/var/spool/pbs/mom_priv/mom.lock' - another mom running
> > 03/13/2006 19:42:19;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > started, Failure job exec failure, after files staged, no retry
> > 03/13/2006 19:42:19;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> > 03/13/2006 19:42:19;0008;   pbs_mom;Job;324.b08l02.oic.ornl.gov;Job
> > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> > 03/13/2006 19:45:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > started, Failure job exec failure, after files staged, no retry
> > 03/13/2006 19:45:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> > 03/13/2006 19:45:50;0008;   pbs_mom;Job;325.b08l02.oic.ornl.gov;Job
> > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> > 03/13/2006 19:46:06;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > started, Failure job exec failure, after files staged, no retry
> > 03/13/2006 19:46:06;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> > 03/13/2006 19:46:06;0008;   pbs_mom;Job;326.b08l02.oic.ornl.gov;Job
> > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> >
> > I have stopped the pbs_moms and removed the mom.locks but I see the
> > same behavior and I then see this in the logs on the mom:
> >
> > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
> > leaving jobs running, just exiting
> > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;Is down
> > 03/13/2006 19:55:06;0002;   pbs_mom;Svr;Log;Log closed
> > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;Log;Log opened
> > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
> > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;restricted;b08l02
> > 03/13/2006 20:00:36;0002;   pbs_mom;n/a;initialize;independent
> > 03/13/2006 20:00:36;0002;   pbs_mom;Svr;pbs_mom;Is up
> > 03/13/2006 20:00:36;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > 03/13/2006 20:01:35;0002;   pbs_mom;Svr;im_eof;End of File from addr
> > 172.16.3.253:15001
> > 03/13/2006 20:01:35;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> > 03/13/2006 20:01:42;0001;   pbs_mom;Svr;is_request;duplicate
> > connection from 172.16.3.253:1023 - closing original connection
> > 03/13/2006 20:02:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> > started, Failure job exec failure, after files staged, no retry
> > 03/13/2006 20:02:38;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> > 03/13/2006 20:02:38;0008;   pbs_mom;Job;328.b08l02.oic.ornl.gov;Job
> > Modified at request of PBS_Server at b08l02.oic.ornl.gov
> >
> >
> > What does this mean other than the pbs server isn't running jobs?  I
> > am at a loss.  I can provide other info if there is something that
> > might help with troubleshooting.
> > Occasionally, but very occasionally, it will run a job correctly.
> > This is the exception rather than the norm, however.
> >
> >
> > Thank you for any assistance you might be able to give me.
> >
> > Sincerely,
> > Jennifer
> > Admin, ORNL Institutional Cluster
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
>
> --
> Garrick Staples, Linux/HPCC Administrator
> University of Southern California
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>


More information about the torqueusers mailing list