[torqueusers] Post job file processing error? Jobs not running. :(

Garrick Staples garrick at usc.edu
Tue Mar 14 02:29:47 MST 2006


Sounds like you are starting new MOM daemons while some stale MOM
processes are still alive.

On Mon, Mar 13, 2006 at 08:06:25PM -0500, Aquarijen alleged:
> Hi All,
> 
> I am having quite a time getting things running here, so I was really
> hoping someone could help.
> 
> I have several submit hosts, a pbs_server, maui and 269 pbs_moms on
> different nodes.
> 
> A user tries to submit a job.  If you look at the queue right at that
> time, you will see the job breifly queued, but it never runs.  I also
> get no output or email.  Looking at the server_logs, I see this on a
> failed job:
> 
> 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from 2vt at b05l01.oic.ornl.gov, sock=14
> 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type QueueJob request
> received from 2vt at b05l01.oic.ornl.gov, sock=11
> 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type JobScript request
> received from 2vt at b05l01.oic.ornl.gov, sock=11
> 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type ReadyToCommit request
> received from 2vt at b05l01.oic.ornl.gov, sock=11
> 03/13/2006 19:46:05;0100;PBS_Server;Req;;Type Commit request received
> from 2vt at b05l01.oic.ornl.gov, sock=11
> 03/13/2006 19:46:05;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;enqueuing
> into workq, state 1 hop 1
> 03/13/2006 19:46:05;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> Queued at request of 2vt at b05l01.oic.ornl.gov, owner =
> 2vt at b05l01.oic.ornl.gov, job name = parallel-worlds-jen, queue = workq
> 03/13/2006 19:46:05;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
> sent command new
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusNode request
> received from root at b08l02.oic.ornl.gov, sock=10
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusQueue request
> received from root at b08l02.oic.ornl.gov, sock=10
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusJob request
> received from root at b08l02.oic.ornl.gov, sock=10
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type ModifyJob request
> received from root at b08l02.oic.ornl.gov, sock=10
> 03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> Modified at request of root at b08l02.oic.ornl.gov
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type RunJob request received
> from root at b08l02.oic.ornl.gov, sock=10
> 03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> Run at request of root at b08l02.oic.ornl.gov
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type ModifyJob request
> received from root at b08l02.oic.ornl.gov, sock=10
> 03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
> Modified at request of root at b08l02.oic.ornl.gov
> 03/13/2006 19:46:06;0100;PBS_Server;Req;;Type JobObituary request
> received from pbs_mom at b08n079.oic.ornl.gov, sock=14
> 03/13/2006 19:46:06;0010;PBS_Server;Job;326.b08l02.oic.ornl.gov;Exit_status=-2
> 03/13/2006 19:46:06;000d;PBS_Server;Job;326.b08l02.oic.ornl.gov;Post
> job file processing error; job 326.b08l02.oic.ornl.gov on host
> b08n079.oic.ornl.gov/1+b08n079.oic.ornl.gov/0
> 03/13/2006 19:46:06;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;dequeuing
> from workq, state COMPLETE
> 03/13/2006 19:46:06;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
> sent command term
> 
> The log on the mom says:
> 
> 03/13/2006 19:08:44;0002;   pbs_mom;Svr;pbs_mom;Is down
> 03/13/2006 19:08:44;0002;   pbs_mom;Svr;Log;Log closed
> 03/13/2006 19:25:08;0002;   pbs_mom;Svr;Log;Log opened
> 03/13/2006 19:25:08;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
> 03/13/2006 19:25:08;0002;   pbs_mom;Svr;restricted;b08l02
> 03/13/2006 19:25:08;0002;   pbs_mom;n/a;initialize;independent
> 03/13/2006 19:25:08;0002;   pbs_mom;Svr;pbs_mom;Is up
> 03/13/2006 19:25:08;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> 03/13/2006 19:26:35;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, after files staged, no retry
> 03/13/2006 19:26:35;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 03/13/2006 19:26:35;0008;   pbs_mom;Job;323.b08l02.oic.ornl.gov;Job
> Modified at request of PBS_Server at b08l02.oic.ornl.gov
> 03/13/2006 19:40:14;0002;   pbs_mom;Svr;Log;Log opened
> 03/13/2006 19:40:14;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily
> unavailable (11) in pbs_mom, cannot lock
> '/var/spool/pbs/mom_priv/mom.lock' - another mom running
> 03/13/2006 19:42:19;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, after files staged, no retry
> 03/13/2006 19:42:19;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 03/13/2006 19:42:19;0008;   pbs_mom;Job;324.b08l02.oic.ornl.gov;Job
> Modified at request of PBS_Server at b08l02.oic.ornl.gov
> 03/13/2006 19:45:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, after files staged, no retry
> 03/13/2006 19:45:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 03/13/2006 19:45:50;0008;   pbs_mom;Job;325.b08l02.oic.ornl.gov;Job
> Modified at request of PBS_Server at b08l02.oic.ornl.gov
> 03/13/2006 19:46:06;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, after files staged, no retry
> 03/13/2006 19:46:06;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 03/13/2006 19:46:06;0008;   pbs_mom;Job;326.b08l02.oic.ornl.gov;Job
> Modified at request of PBS_Server at b08l02.oic.ornl.gov
> 
> I have stopped the pbs_moms and removed the mom.locks but I see the
> same behavior and I then see this in the logs on the mom:
> 
> 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
> leaving jobs running, just exiting
> 03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;Is down
> 03/13/2006 19:55:06;0002;   pbs_mom;Svr;Log;Log closed
> 03/13/2006 20:00:36;0002;   pbs_mom;Svr;Log;Log opened
> 03/13/2006 20:00:36;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
> 03/13/2006 20:00:36;0002;   pbs_mom;Svr;restricted;b08l02
> 03/13/2006 20:00:36;0002;   pbs_mom;n/a;initialize;independent
> 03/13/2006 20:00:36;0002;   pbs_mom;Svr;pbs_mom;Is up
> 03/13/2006 20:00:36;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> 03/13/2006 20:01:35;0002;   pbs_mom;Svr;im_eof;End of File from addr
> 172.16.3.253:15001
> 03/13/2006 20:01:35;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
> 03/13/2006 20:01:42;0001;   pbs_mom;Svr;is_request;duplicate
> connection from 172.16.3.253:1023 - closing original connection
> 03/13/2006 20:02:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, after files staged, no retry
> 03/13/2006 20:02:38;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
> 03/13/2006 20:02:38;0008;   pbs_mom;Job;328.b08l02.oic.ornl.gov;Job
> Modified at request of PBS_Server at b08l02.oic.ornl.gov
> 
> 
> What does this mean other than the pbs server isn't running jobs?  I
> am at a loss.  I can provide other info if there is something that
> might help with troubleshooting.
> Occasionally, but very occasionally, it will run a job correctly. 
> This is the exception rather than the norm, however.
> 
> 
> Thank you for any assistance you might be able to give me.
> 
> Sincerely,
> Jennifer
> Admin, ORNL Institutional Cluster
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060314/dba67c7e/attachment-0001.bin


More information about the torqueusers mailing list