[torqueusers] Post job file processing error? Jobs not running. :(

Aquarijen aquarijen at gmail.com
Mon Mar 13 18:06:25 MST 2006


Hi All,

I am having quite a time getting things running here, so I was really
hoping someone could help.

I have several submit hosts, a pbs_server, maui and 269 pbs_moms on
different nodes.

A user tries to submit a job.  If you look at the queue right at that
time, you will see the job breifly queued, but it never runs.  I also
get no output or email.  Looking at the server_logs, I see this on a
failed job:

03/13/2006 19:46:05;0100;PBS_Server;Req;;Type AuthenticateUser request
received from 2vt at b05l01.oic.ornl.gov, sock=14
03/13/2006 19:46:05;0100;PBS_Server;Req;;Type QueueJob request
received from 2vt at b05l01.oic.ornl.gov, sock=11
03/13/2006 19:46:05;0100;PBS_Server;Req;;Type JobScript request
received from 2vt at b05l01.oic.ornl.gov, sock=11
03/13/2006 19:46:05;0100;PBS_Server;Req;;Type ReadyToCommit request
received from 2vt at b05l01.oic.ornl.gov, sock=11
03/13/2006 19:46:05;0100;PBS_Server;Req;;Type Commit request received
from 2vt at b05l01.oic.ornl.gov, sock=11
03/13/2006 19:46:05;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;enqueuing
into workq, state 1 hop 1
03/13/2006 19:46:05;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
Queued at request of 2vt at b05l01.oic.ornl.gov, owner =
2vt at b05l01.oic.ornl.gov, job name = parallel-worlds-jen, queue = workq
03/13/2006 19:46:05;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
sent command new
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusNode request
received from root at b08l02.oic.ornl.gov, sock=10
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusQueue request
received from root at b08l02.oic.ornl.gov, sock=10
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type StatusJob request
received from root at b08l02.oic.ornl.gov, sock=10
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type ModifyJob request
received from root at b08l02.oic.ornl.gov, sock=10
03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
Modified at request of root at b08l02.oic.ornl.gov
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type RunJob request received
from root at b08l02.oic.ornl.gov, sock=10
03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
Run at request of root at b08l02.oic.ornl.gov
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type ModifyJob request
received from root at b08l02.oic.ornl.gov, sock=10
03/13/2006 19:46:06;0008;PBS_Server;Job;326.b08l02.oic.ornl.gov;Job
Modified at request of root at b08l02.oic.ornl.gov
03/13/2006 19:46:06;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at b08n079.oic.ornl.gov, sock=14
03/13/2006 19:46:06;0010;PBS_Server;Job;326.b08l02.oic.ornl.gov;Exit_status=-2
03/13/2006 19:46:06;000d;PBS_Server;Job;326.b08l02.oic.ornl.gov;Post
job file processing error; job 326.b08l02.oic.ornl.gov on host
b08n079.oic.ornl.gov/1+b08n079.oic.ornl.gov/0
03/13/2006 19:46:06;0100;PBS_Server;Job;326.b08l02.oic.ornl.gov;dequeuing
from workq, state COMPLETE
03/13/2006 19:46:06;0040;PBS_Server;Svr;b08l02.oic.ornl.gov;Scheduler
sent command term

The log on the mom says:

03/13/2006 19:08:44;0002;   pbs_mom;Svr;pbs_mom;Is down
03/13/2006 19:08:44;0002;   pbs_mom;Svr;Log;Log closed
03/13/2006 19:25:08;0002;   pbs_mom;Svr;Log;Log opened
03/13/2006 19:25:08;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
03/13/2006 19:25:08;0002;   pbs_mom;Svr;restricted;b08l02
03/13/2006 19:25:08;0002;   pbs_mom;n/a;initialize;independent
03/13/2006 19:25:08;0002;   pbs_mom;Svr;pbs_mom;Is up
03/13/2006 19:25:08;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
03/13/2006 19:26:35;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
03/13/2006 19:26:35;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/13/2006 19:26:35;0008;   pbs_mom;Job;323.b08l02.oic.ornl.gov;Job
Modified at request of PBS_Server at b08l02.oic.ornl.gov
03/13/2006 19:40:14;0002;   pbs_mom;Svr;Log;Log opened
03/13/2006 19:40:14;0001;   pbs_mom;Svr;pbs_mom;Resource temporarily
unavailable (11) in pbs_mom, cannot lock
'/var/spool/pbs/mom_priv/mom.lock' - another mom running
03/13/2006 19:42:19;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
03/13/2006 19:42:19;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/13/2006 19:42:19;0008;   pbs_mom;Job;324.b08l02.oic.ornl.gov;Job
Modified at request of PBS_Server at b08l02.oic.ornl.gov
03/13/2006 19:45:50;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
03/13/2006 19:45:50;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/13/2006 19:45:50;0008;   pbs_mom;Job;325.b08l02.oic.ornl.gov;Job
Modified at request of PBS_Server at b08l02.oic.ornl.gov
03/13/2006 19:46:06;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
03/13/2006 19:46:06;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/13/2006 19:46:06;0008;   pbs_mom;Job;326.b08l02.oic.ornl.gov;Job
Modified at request of PBS_Server at b08l02.oic.ornl.gov

I have stopped the pbs_moms and removed the mom.locks but I see the
same behavior and I then see this in the logs on the mom:

03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;caught signal 15:
leaving jobs running, just exiting
03/13/2006 19:55:06;0002;   pbs_mom;Svr;pbs_mom;Is down
03/13/2006 19:55:06;0002;   pbs_mom;Svr;Log;Log closed
03/13/2006 20:00:36;0002;   pbs_mom;Svr;Log;Log opened
03/13/2006 20:00:36;0002;   pbs_mom;Svr;usecp;b08l02:/home /home
03/13/2006 20:00:36;0002;   pbs_mom;Svr;restricted;b08l02
03/13/2006 20:00:36;0002;   pbs_mom;n/a;initialize;independent
03/13/2006 20:00:36;0002;   pbs_mom;Svr;pbs_mom;Is up
03/13/2006 20:00:36;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
03/13/2006 20:01:35;0002;   pbs_mom;Svr;im_eof;End of File from addr
172.16.3.253:15001
03/13/2006 20:01:35;0002;   pbs_mom;n/a;mom_main;hello sent to server b08l02
03/13/2006 20:01:42;0001;   pbs_mom;Svr;is_request;duplicate
connection from 172.16.3.253:1023 - closing original connection
03/13/2006 20:02:38;0001;   pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
03/13/2006 20:02:38;0008;   pbs_mom;Req;send_sisters;sending ABORT to sisters
03/13/2006 20:02:38;0008;   pbs_mom;Job;328.b08l02.oic.ornl.gov;Job
Modified at request of PBS_Server at b08l02.oic.ornl.gov


What does this mean other than the pbs server isn't running jobs?  I
am at a loss.  I can provide other info if there is something that
might help with troubleshooting.
Occasionally, but very occasionally, it will run a job correctly. 
This is the exception rather than the norm, however.


Thank you for any assistance you might be able to give me.

Sincerely,
Jennifer
Admin, ORNL Institutional Cluster


More information about the torqueusers mailing list